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Preface 


Machine  learning  is  a  study  of  computational  methods  for  acquiring  knowledge  and 
improving  problem  solving  ability.  Because  of  the  breadth  of  this  charter,  machine  learning 
includes  a  wide  range  of  topics.  This  volume  collects  research  results  from  twelve  areas  of 
machine  learning  which  were  represented  at  the  Seventh  International  Conference  on 
Machine  Learning,  held  June  21-23, 1990  at  the  University  of  Texas  in  Austin. 

The  165  technical  papers  submitted  to  the  conference  provide  evidence  that  machine 
learning  continues  to  mature  and  evolve.  New  areas  of  active  research,  such  as  robot  learn¬ 
ing,  have  emerged,  presenting  challenging  new  problems  and  applications.  Furthermore, 
many  papers  described  research  that  cuts  across  traditional  boundaries  in  machine  learning 
and  synthesizes  disparate  results.  \ 

Tom  Mitchell,  Doug  Lenat,  Lenny  Pitt,  and  Don  Michie  were  invited  to  discuss  their 
ongoing  projects  at  the  conference.  Their  sometimes  unconventional  views  provided  contro¬ 
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versy  and  perspective. 

The  success  of  the  conference  is  (^ue  to  the  concern  and  dedication  of  all  those  involved. 
In  particular,  the  twenty-four  memb^s  of  the  program  committee  earned  the  praise  of  the 
authors  because  of  their  comprehensive  and  insightful  reviews.  The  conference  staff— John 
Haradon,  Yvonne  Van  Olphen,  and  Bess  Sullivan — provided  invaluable  organizational  skills. 
Finally,  Shirley  Jowell  of  Morgan  j6iufmann  Publishers  produced  this  excellent  proceedings. 

The  conference  was  funded  in  part  by  generous  contributions  from  the  Ofi&e  of  Naval 
Research,  the  Defense  Advancea  Research  Projects  Agency,  and  the  National  Science  Foun¬ 
dation.  Their  long-term  suppdrt  for  machine  learning  research  has  been  critical  to  the  suc- 
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EMPIRICAL  LEARNING 


2  Arunkumar  and  Ycgneshwar 


Knowledge  Acquisition  from  Examples 
using  Maximal  Representation  Learning 
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Abstract 

In  this  paper  we  describe  a  new  Knowledge 
Acquisition  technique  based  on  the  learn¬ 
ing  from  examples  paradigm.  This  technique 
uses  the  principle  of  maximal  representation 
of  attributes  which  clusters  the  examples  into 
sub-descriptions  using  the  frequencies,  and 
orders  the  attributes  based  on  how  well  it  rep¬ 
resents  the  class.  The  sub-dcscriptions  aid 
in  simple  definitions  of  importance  and  re¬ 
dundancy  of  attributes  for  a  class  which  are 
then  used  to  discard  unimportant  attributes 
and  thereby  simplify  class  description.  The 
resultant  description  is  the  simplest  charac¬ 
teristic  description  which  describes  the  class 
conjpletely  with  respect  to  the  learning  ex¬ 
amples  and  the  attributes  specified.  This 
helps  not  only  in  describing  the  class  well 
but  also  in  potentially  discriminating  the  cur¬ 
rent  class  from  all  other  classes  envisaged. 

An  inference  algorithm  based  on  the  frequen¬ 
cies  of  the  sub-descriptions  and  importance  of 
attributes  is  used  to  classify  new  examples. 

This  system  has  been  tested  on  two  real-life 
applications,  the  results  of  which  are  highly 
encouraging. 

1  Introduction 

Knowledge  acquisition  is  the  process  of  acqtiiring 
knowledge  to  create  expertise  for  solution  of  problems 
in  a  particular  field  of  application.  This  is  one  of  the 
main  bottlenecks  in  the  development  of  Expert  Sys¬ 
tems  [Duda  83],  and  there  have  been  many  attempts  to 
automate  the  process  using  learning  techniques  [Quin 
86]  [Mich  80]  [Lee  86]  [Paw  84]  [Fu  85].  However,  the 
emphasis  of  most  of  these  systems  is  on  learning  dis¬ 
criminative  description  of  the  class  (represented  as  de¬ 
cision  trees  [Quin  80][Lcc  86][Craw  89]  or  as  clusters 
[Mich  83])  which  helps  in  performance,  but  does  not 
generate  an  intelligible  knowledge  base.  This  is  be¬ 
cause  the  aim  of  discrimination  is  to  give  a  description 
which  deterministically  separates  examples  of  one  class 

’This  work  is  part  of  the  Intelligent  Systems  Project  at 
IIT,  Bombay,  India 


from  examples  of  other  classes,  rather  than  to  find  a 
complete  description  of  the  given  class  based  on  the 
set  of  attributes  specified.  For  instance,  the  two  con¬ 
cepts  chair  and  stool  can  be  discriminated  using  the 
attribute  backrest,  though  this  attribute  is  not  suffi¬ 
cient  to  describe  either  of  them.  Our  emphasis  here,  is 
on  learning  the  characteristic  description  of  the  class. 
There  is  evidence  that  humans  too  prefer  characteris¬ 
tic  description  as  highlighted  in  [Med  87]. 

A  domrun  independent  system  which  learns  a  char¬ 
acteristic  description  of  the  class  and  that  determines 
the  redundancy  of  attributes  was  proposed  by  the  first 
author  and  first  reported  in  [Doct  85].  The  system 
described  here  is  a  modified  version  of  the  above  work 
which  in  addition  to  having  a  modified  learning  algo¬ 
rithm  and  definition  of  redundancy,  also  leaurns  impor¬ 
tance  of  attributes  for  a  class  [Yeg  87]  [Arun  89].  In 
this  paper,  we  restrict  to  examples  specified  by  binary 
valued  attributes. 

In  the  following  section  we  explain  the  proposed 
learning  of  the  class  description  from  examples  for  bi¬ 
nary  valued  attributes  and  prove  its  convergence.  In 
section  3  we  explain  how  redundancy  of  attributes  for 
a  class  is  determined.  The  importance  of  an  attribute 
is  defined  in  section  4.  In  section  5  we  explain  the 
process  of  inference  based  on  category  validity  [Med 
87].  We  highlight  the  two  applications  in  section  6, 
and  conclude  the  paper  in  section  7. 

2  Learning  Class  Description  from 
Examples 

The  system  learns  descriptions  of  the  classes  for  which 
examples  have  been  given.  Each  learning  example  is 
represented  as  a  vector  or  as  (attribute,value)  pairs. 
For  instance  an  example  of  a  stool  may  be  represented 
as  (number  of  legs,  4)  (height,  short)  (shape  of  top,  cir¬ 
cular).  The  class  description  is  represented  as  a  binary 
tree  called  Generation  Tree  (GT). 

2.1  The  Generation  Tree 

The  Generation  Tree  is  learnt  using  the  principle  of 
maximal  representation  which  states  that  at  each  node 
of  the  tree  select  that  attribute  whose  most  rejircsen- 
tative  value  has  the  highest  cardinality.  This  learning 
criterion  ensures  that  relatively  more  important  at- 
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tributes  arc  selected  earlier  in  the  tree  and  the  less 
important  ones  lower  down.  The  Generation  Tree  is 
contrtictcd  top-down  from  the  root  to  the  leaves. 

2.2  Algorithm  for  construction  of  GT 

The  GT  is  built  using  the  criterion  of  maximal  repre¬ 
sentation  at  each  node. 

1.  Consider  the  given  set  of  examples  at  the  root 
node.  Select  the  most  representative  attribute 
among  the  examples  at  the  current  node.  This 
is  done  as  follows: 

(a)  If  both  the  binary  values  occur  for  the  given 
attribute  among  the  set  of  examples  at  the 
current  node,  then  find  the  cardinality  of 
both  the  sets,  wiicre  ail  the  examples  in  the 
first  set  have  one  value  (say  0)  for  this  at¬ 
tribute  and  all  the  examples  in  the  other  set 
have  the  complementary  value  (value  1)  for 
this  attribute.  Else,  find  the  cardinality  of 
the  given  set. 

(b)  Find  the  maximum  of  the  cardinality  of  the 
resulting  sets  (or  set). 

(c)  Repeat  steps  (a)  and  (b)  for  all  the  attributes 
and  select  that  attribute  as  most  representa¬ 
tive  which  has  the  maximum  cardinality  as 
determined  in  step  (b). 

2.  For  both  the  values  (0  and  1)  branch  off  from  the 
current  node.  At  the  left  child  node  consider  all 
those  examples  having  value  0  for  further  branch¬ 
ing,  and  at  the  right  child  consider  all  those  ex¬ 
amples  having  value  1  for  further  branching. 

3.  Termination  Condition  :  If  all  the  attributes 
have  been  selected  at  some  node  along  all  the 
paths  of  the  tree  or  if  the  frequency  along  all  the 
leaves  is  less  than  a  specified  threshold  called  ex¬ 
pansion  threshold,  then  stop.  Else,  go  to  step  2 
and  proceed  in  a  depth  first  fashion. 

Example  2.1  Given  the  following  set  of  examples 
specified  by  three  attributes  and  represented  in  vector 
form,  and  the  expansion  threshold  equal  to  0. 

S  =  {(101)(011)(101)(111)(101)(011)(111) 
(101)(111)(101)} 

We  will  show  how  the  GT  is  built  for  this  Example  set. 
We  have, 

{(I  ai  =  1  I  =  8)  (I  ai  =  0  I  =  2)  (I  02  =  1  I  =  5) 

(I  02  =  1)  !  =  5)  (1 03  =  1  !  =  10)  (!  03  =  01  =  0)} 

wheiv  I  .  I  stands  for  cardinality. 

Since  03  satisfies  the  maximal  representation  learn¬ 
ing  criterion,  at  the  root  node  we  select  03.  Then  at 
the  next  node  we  select  oi  and  proceeding  similarly,  we 
get  the  GT  desenbed  in  figure  1. 

Theorem  2.1  The  Generation  Tree  constructed  by 
the  algorithm  described  above  converges  in  the  limit. 
Proof  :  By  Glivenko-Cantelli  Theorem,  F„(x)  - 
F(x)  0,  with  probability  1  uniformly  in  x,  where 
F„  is  the  empirical  distribution  function  and  F  the 


true  distribution.  This  implies  that  for  o  given  c  >  0, 
3n(e)  suck  that  |  Fn{x)  —  F{x)  |<  e,  with  probability 
1,  Vn  >  n{e). 

If  we  assume  that  the  joint  frequencies  are  such  that 
in  the  “limiting  description”  there  are  no  ties  in  the 
selection  of  attributes  at  all  the  nodes  of  the  Genera¬ 
tion  Tree,  we  can  choose  e  such  that  for  all  n  >  n(6), 
chosen  .suitably,  the  structure  of  the  Generation  Tree 
stabilises.  Once  the  structure  stabilises  the  frequencies 
also  stabilise.  Furthermore,  by  the  Central  Limit  The¬ 
orem,  this  convergence  is  at  the  rate  of  \/n. 

2.3  Class  Description  from  Generation  Tree 

The  class  description  is  represented  as  a  binary  tree 
where  nodes  arc  labelled  by  attributes  and  arcs  by  the 
corresponding  attribute’s  value  and  joint  frequency. 
The  attrib\itc  at  a  node  is  selected  using  the  maxi¬ 
mal  representation  learning  criterion  explained  in  sec¬ 
tion  2.2.  The  Generation  Tree  is  a  disjunction  of 
sub-descriptions.  Each  sub-description  corresponds 
to  a  path  in  the  GT,  and  is  a  conjunction  of  (at- 
tributc,valuc)  pairs  of  that  path.  The  sequence  of 
the  (attribute,vaiue)  pairs  in  the  sub-description  spec¬ 
ifics  that  each  such  pair  is  conditioned  on  the  previous 
pairs.  It  is  this  sequence  which  helps  determine  the  im¬ 
portance  of  attributes.  In  Example  2.1  there  are  three 
sub-descriptions,  viz.  {(03  =  1)  &  (oj  =  1)  &  (03  = 
1)  0.3),  {(03  =  1)  &  (ai  =  1)  &  (02  =  0)  0.5),  {(03  = 
1)  &  (oi  =  0)  &  (02  =  1)  0.2}. 

The  binary  tree  structure  for  class  description  is 
found  to  be  powerful  enough  for  practical  applications. 
It  also  helps  in  evaluation  of  importance  of  attributes 
for  a  class,  and  aids  in  determining  redundancy  of  at¬ 
tributes  in  a  simple  manner. 

3  Redundancy  of  an  attribute  at  a 
node  of  a  GT 

The  aim  of  finding  if  an  attribute  is  redundant  at  a 
node  is  to  simplify  the  class  description  by  pruning 
the  Generation  Tree.  It  may  also  lead  to  finding  re¬ 
dundant  attributes  for  a  class.  A  simple  class  descrip¬ 
tion  helps  in  easier  explanation,  and  the  collection  of 
less  information  from  the  user.  The  latter  could  mean 
saving  in  terms  of  time  and  money  for  the  user  if  the 
attribute  is  a  complex  test  in  a  diagnostic  context. 

In  the  GT  framework,  it  is  very  easy  to  determine 
redundancy  of  an  attribute  at  a  node.  The  GT  is  con¬ 
structed  top-down,  whereas  redundancy  is  determined 
bollum-up.  The  reason  for  this  is  as  follows  :  Only 
at  the  bottom-most  node  of  a  path,  redundancy  can 
be  determined  at  any  time  because,  at  any  other  node 
the  correlation  of  subsequent  attributes  is  not  known. 
Since  the  attribute  at  the  bottom-most  node  is  condi¬ 
tioned  on  all  the  earlier  attributes  no  more  information 
is  required  and  the  redundancy  check  can  be  applied 
at  this  node. 

Definition  3.1  An  attribute  is  said  to  be  redundant 
at  a  node  if  the  node  is  the  lowest  in  the  path  and  both 
the  domain  values  occur  with  equal  fre(iucncy. 
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Note:  The  equality  is  in  the  ideal  case.  In  practice, 
however,  if  the  percentage  difference  in  frequency  is 
less  than  a  specified  threshold  then,  that  node  can  be 
deleted. 

Definition  3.2  An  attribute  is  redundant  for  a  class 
if  it  is  redundant  at  all  the  nodes  where  it  is  ehosen, 

i.e.,  it  does  not  occur  at  any  node  of  the  GT  after  the 
redundancy  check  is  applied. 

Definition  3.3  A  class  is  said  to  be  a  null  class  if 
each  element  of  the  cartesian  product  of  the  domain 
of  attributes  occurs  with  equal  frequency.  In  the  GT 
framework  such  a  class  should  be  represented  by  a  sin¬ 
gle  node. 

Note:  The  definition  of  redundancy  helps  reduce  the 
GT  in  such  a  case  to  a  single  node.  This  is  illustrated 
by  the  following  example. 

Example  3.1  In  this  example  we  show  that  the  given 
set  of  samples  S={(1  1)  (1  0)  (0  1)  (0  0))  represents 
the  null  class.  This  also  illustrates  redundancy  at  a 
node  and  attribute  redundancy  for  a  class.  The  re¬ 
duction  of  the  initial  GT  to  a  single  node  is  described 
pictorially  in  figure  2. 

Note:  The  redundancy  of  an  attribute  for  decision 
tree  and  clustering  systems  is  based  on  the  discrimi¬ 
native  power  of  the  attribute  whereas  in  KAHLE,  an 
attribute  is  redundant  because  it  does  not  add  to  the 
description  of  the  class. 

4  Importance  of  an  attribute 

The  Importance  of  an  attribute  is  the  relevance  of 
the  attribute  for  a  particular  class.  It  is  denoted  as 
I  Cx),  read  as  Importance  of  an  attribute  o,.  given 
class  Cx-  The  importance  of  attributes  for  a  class  not 
only  aids  in  inference  under  uncertainty  but  also  helps 
prune  the  GT  by  enabling  dropping  of  attributes. 

Since  the  Importance  of  an  attribute  is  judged  by 
the  representativeness  of  the  attribute  in  the  class,  we 
consider  the  Importance  of  an  attribute  given  the  class 
as  directly  proportional  to  the  frequency  at  the  node 
at  which  it  occurs,  and  inversely  proportional  to  the 
distance  of  the  node  from  the  root. 

The  importance  lies  between  0  and  1,  and  satisfies 
the  following  property: 

n 

X;/(ar|Cx)  =  l  (1) 

r=l 

As  a  first  approximation  we  have  chosen  the  fol¬ 
lowing  form  for  /(ur  j  Cx)  which  satisfies  the  above 
constraints. 

<r 

liar  I  Cx)  =  53(2/(n  +  (n  +  1))  *  (n  -  drp)  *  frp  (2) 

;,=  1 

where 

11  is  the  distance  from  leaf  to  the  root  node  (i.e.  equal 
to  the  number  of  attributes  considered), 

2/(n*(n-f  1)  is  the  normalising  factor, 

tr  is  the  number  of  occurrences  of  attribute  Or  in  the 

GT, 


drp  is  the  distance  of  the  pth  occurrence  of  attribute 

Or  from  the  root,  and 

ftp  is  the  joint  frequency  at  this  node. 

Note:  In  both  decision  tree  and  clustering  systems 
the  importance  of  an  attribute  for  a  class  is  not  ex¬ 
plicitly  evaluated.  However,  in  decision  tree  systems 
it  is  implicit  that  an  attribute  occuring  higher  in  the 
tree  is  more  important  for  the  application  (i.e.,  for  all 
the  classes)  than  those  occuring  lower  down,  because 
the  entropy  measure  is  used  to  select  the  atribute  at 
each  node. 


6  The  Inference  Process 


The  inference  process  used  in  our  system  is  a  modified 
version  of  Category  Validity  [Med  87]  which  is  defined 
as  the  probability  of  the  attribute  value  set  given  the 
class.  The  attribute  value  set  corresponds  to  a  path 
in  the  GT.  The  modification  was  necessary  to  help 
classify  test  examples  having  missing  attribute  values. 
The  process  is  described  below: 

1.  Set  the  first  class  Ci  to  C,-. 

2.  Start  with  the  root  node  of  the  GT  corresponding 
to  the  class  C,-. 

3.  At  each  node  of  the  GT  traverse  that  arc  whose 
attribute  value  is  the  same  as  that  of  the  input 
example  e.  If  the  value  is  different  from  all  the 
arcs  existing  at  that  node  then,  output  validity 
as  zero  and  stop.  If  the  value  is  missing  then, 
traverse  all  the  arcs  at  that  node.  Repeat  this 
for  all  the  nodes.  Then,  evaluate  /(e  |  Ci)  the 
Validity  of  cla.s8  C,  as  shown  below: 


nl 


/(e|C7,)  =  {f{epi\Ci)*'£l{a,\Ci)-l- 

i=i 

n2 

fiep2\Ci)*J^Iiai\Ci)  +  ....+ 
l=l 
nk 

fiepk\Ci)*J2lia,\Ci)] 


1=1 


nj 


j=i  1=1 

where 

f{c  j  Ci)  is  the  Validity  of  class  C;  given  exami)lc 
c. 


pj  is  the  jth  path  which  the  evidence  matches  ei¬ 
ther  completely  or  incompletely  (incompletely  if 
the  example  has  missing  attribute  values), 

Cpj  is  the  set  of  (attribute, value)  pairs  along  path 
pj. 

ficpj  I  Ci)  the  validity  of  class  C,-  given  Cpj,  is  the 
leaf  frequency  along  path  pj,  and 

nj  is  the  number  of  attributes  along  path  pj  for 
which  value  is  available  in  e. 
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4.  If  Ci  is  tlie  last  class  then  go  to  next  step  else 
repeat  steps  2  and  3  for  the  next  class,  viz.,  for 
Ci  :=  C7,+i. 

5.  Select  that  Ci  for  which  /(c  |  Ci)  is  the  maximum. 
If  the  maximum  value  is  equal  to  zero  then  output 
“Not  Classified”  and  exit;  else  output  C,-. 

Example  5.1  Suppose  we  have  three  elasses  for  which 
the  GTs  are  as  shown  in  fig.  S, 

Suppose  two  examples  to  be  classified  are  Si  = 


{(oi  =  1)  (o2  =  0)  (03  =?)}  and  S2  =  {(ur  =  1)  (02  = 


1)  (03  = 

=  1)}. 

where  ?  stands  for  missing 

value  then, 

/(Cpl 

|Ci) 

=  /((oi 

=  l)(a2  - 

0)(a3 

=  1)1  Cl) 

=  0.6 

/(ep2 

|Ci) 

=  /((“i 

II 

II 

0)(a3 

=  0)  1  Cl) 

=  0.3 

I{ai 

|Ci) 

=  2/(3* 

(3-1-1))* 

(3  -  0)  ♦  1  =  0.5 

/(02 

ICi) 

=  2/(3* 

:  (3  +  1))  * 

((3- 

1)  *  0.9  -1- 

((3- 

1)*0.1)  = 

0.33 

Similarly  the  other  importances  can  be  calculated. 
The  values  are  as  follows: 


I{ai  I  Cl) 

1(02  1  Cl) 

^(03  I  Cl) 

f{Si  I  Cl) 


f{Si  I  C2) 


f{Si  I  C3) 


I{ai  I  C2)  =  liai  I  C3)  =  0.5 
/(02  I  C2)  =  I{a2  I  C3)  =  0.33 
1(03  I  C2)  =  I{a3  I  C3)  =  0.17 
/(epilCi)*(/(oi|Ci) 

+  /(02|Ci))  +  /(ep2|C'i)* 

(/(Oi  I  Cl)  +  1(02  I  Cl)) 

0.6  ♦  (0.5  +  0.33)  +  0.3  *  (0.5  +  0.33) 
0.75 

f{Cpl  I  C2)  *  (/(oi  1  C2) 

+  /(a2|C'2))  +  /(ep2|C2)* 
(/(ai|C2)  +  I(02|C2)) 

0.1  *  (0.5  +  0.33)  +  0.7  *  (0.5  +  0.33) 
0.67 

0.0.  [since  Si  does  not  match  any 
path  of  C3]. 


Therefore,  Si  is  classified  as  Ci.  S2  cannot  be  clas¬ 
sified  since  it  does  not  match  (even  incompletely)  any 
path  of  any  of  the  three  GTs. 


6  Practical  Applications 

The  system  has  been  implemented  as  a  set  of  programs 
in  Pascal  on  an  MC68000  Unix  machine.  The  inputs 
are  learning  examples,  expansion  thivshold,  sample  size 
of  each  class,  attribute  names,  and  the  testing  exam¬ 
ples.  For  eeich  class,  the  frequency  of  “rightly  classi¬ 
fied”,  the  frequency  of  “not  classified”  the  frequency 
of  “wrongly  classified”  examples,  and  the  number  of 
sub-descriptions  are  output. 

Medicine  has  the  best  potential  for  application  of 
learning  systems  as  there  is  tremendous  variation  in 
the  example  cases  of  each  disease.  Here,  a  class  is  a 
disease  and  an  example  is  a  patient  record.  We  have 


chosen  a  particular  problem  of  stomach  diseases  con¬ 
cerning  classification  of  hiatal  hernia  and  gallstones 
as  the  first  experiment.  The  second  experiment  of 
characterisation  of  the  two  political  parties  of  USA, 
viz.  Democrat  and  Republican  helps  substantiate  our 
model. 

Application  1^:  This  is  a  two  class  example  involving 
either  hiatal  hernia  or  gallstones.  Examples  are  speci¬ 
fied  by  11  binary  VcJued  attributes.  The  total  number 
of  examples  cire  107  of  which  in  the  first  experiment 
54  was  used  for  learning  and  53  for  testing.  In  the  sec¬ 
ond  experiment  these  example  sets  were  interchanged. 
The  expansion  threshold  in  both  the  experiments  is 
0.20.  The  performance  details  of  the  two  experiments 
are  given  in  tables  1  and  2  respectively. 

The  classification  accuracy  defined  as  (freq.  of 
rightly  classified  /  freq.  of  totally  classified)  is  ap¬ 
prox.  80%  for  the  first  experiment  and  70%  for  the 
second.  These  are  quite  good  considering  that  the 
classes  are  fuzzy  and  are  overlapping.  The  number  of 
sub-descriptions  learnt  are  much  lower  than  the  num¬ 
ber  of  distinct  examples  as  seen  from  tables  1  and 
2.  In  both  the  experiments,  attributes  04,  aj  and 
as  have  an  importance  value  less  than  0.04  (which  is 
very  low)  for  both  the  classes.  Hence  these  attributes 
were  deleted  and  the  experiment  was  carried  out  for 
both  the  databases.  The  number  of  sub-descriptions 
for  both  the  classes  are  lower  than  that  of  the  earlier 
experiment,  and  stibstantially  lower  than  the  number 
of  distinct  examples.  The  Generation  Tree  learnt  is 
thus  compact.  The  performance  details  given  in  ta¬ 
ble  3  is  comparable  to  the  performance  of  the  earlier 
experiment. 

The  performance  after  the  deletion  of  the  attributes 
04,  aj,  and  as  (since  these  attributes  once  again  had 
very  low  importance  value)  for  the  second  database  is 
as  shown  in  table  4.  The  number  of  sub-descriptions 
is  once  again  substantially  lower  than  the  number  of 
distinct  examples.  In  this  case  the  accuracy  has  im¬ 
proved  from  70%  to  79%  which  substantiates  the  fact 
that  having  more  than  the  optimal  number  of  attributes 
tends  to  degrade  the  performance  [Chan  75]. 
Application  2  :  The  database  is  the  1984  Congres¬ 
sional  voting  pattern  records  consisting  of  two  classes, 
viz.  Democrats  and  Republicans.  There  are  sixteen 
binary  valued  attributes  for  which  each  member  has 
to  vote.  The  total  number  of  examples  is  435  of  which 
232  has  all  the  attribute  values  and  203  has  at  least 
one  missing  attribute  value.  So,  the  first  set  is  used 
for  learning  and  the  second  set  for  testing.  The  value 
of  expansion  threshold  is  0.25.  The  results  of  th.c  study 
are  as  described  in  table  5. 

The  classification  accuracy  is  equal  to  94.5%.  This 
compares  favourably  with  [Schl  87]  where  the  accuracy 
reported  is  between  90%  to  95%. 

Since  the  importance  of  attributes  a2  and  aio  were 
zero  for  both  the  classes,  and  the  importance  of  at¬ 
tributes  oi,  ail  and  ai3  were  very  low  they  were 

’This  data  hets  been  taken  fioin  “Feature  selection 
for  binary  data  -  medical  diagnosis  with  fuzzy  sets”  by 
J.C.B.;zdek  in  proc.  of  AFIPS,  1976,  pp.1059-1062. 
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dropped  and  the  system  was  tested  with  the  same  ex¬ 
pansion  threshold.  The  number  of  sub-descriptions  for 
both  the  classes  after  deletion  of  attributes  is  lower 
than  for  the  above  experiment  and  much  lower  than 
the  sample  sizes  of  the  two  classes.  The  frequency  de¬ 
tails  of  the  experiment  given  in  table  6  is  exactly  the 
same  as  in  the  previous  experiment. 

7  Conclusions 

We  have  described  a  new  domeun  independent  learning 
model  for  knowledge  acquisition  from  examples  spec- 
ifted  by  binary  valued  attributes  using  the  maximal 
representation  learning  criterion.  The  main  features 
of  this  model  are  learning  of  a  provably  convergent 
characteristic  description  of  a  class,  and  learning  of 
importance  and  redundancy  of  attributes.  The  char¬ 
acteristic  description  generated  here  is  the  simplest 
definition  having  all  the  information  of  the  le.'irning 
examples.  This  description  can  potentially  discrimi¬ 
nate  the  class  from  all  other  classes  envisaged.  The 
concept  of  importance  and  redundancy  of  attributes 
helps  in  learning  a  compact  class  description  without 
degradation  in  performance. 

The  study  demonstrates  that  our  approach  learns  a 
compact  characteristic  description  of  the  class  with¬ 
out  sacrificing  performance  for  two  diverse  applica¬ 
tions  and  is  hence  encouraging.  In  the  second  exper¬ 
iment  though  all  the  test  examples  had  at  least  one 
missing  attribute  value,  the  performance  is  good. 

We  have  restricted  to  binary  valued  attributes  in 
this  paper.  However,  this  is  not  a  limitation  of  the 
method.  The  learning  technique  discussed  in  this  pa- 
per  can  be  used  for  non-binary  (nominal  and  real)  val¬ 
ued  attributes  with  a  m-way  branching  at  each  node 
instead  of  a  two-way  branching.  All  the  definitions 
for  binary  valued  attributes  are  easily  extendable  for 
this  case.  The  implementation  of  the  system  for  these 
attributes  is  complete,  and  empirical  validation  is  in 
progress. 
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Figure.  1.  The  Generation  Tree 


Figure. 3.  GTs  For  The  Three  Classes 


8  Arunkumar  and  Yegneshwar 


Table  1:  Frequency  details  for  the  first  database 


m 

freq.  of 
learning 
examples 

number 

of 

sub-des 

H 

freq.  of 
not 

classd. 

freq.  of 
wrongly 
classd. 

1 

nnn^^niiiii 

11 

0 

2 

2 

25 

11 

16 

0 

Table  2:  Frequency  details  for  the  second  database 


■ 

freq.  of 
learning 
examples 

freq.  of 
rightly 
classd. 

freq.  of 
not 
classd. 

freq.  of 
wrongly 
classd. 

1 

28 

9 

22 

1 

6 

2 

25 

9 

15 

1 

9 

Table  3:  Frequency  details  after  deletion  of  attributes  for  first  database 


■ 

freq.  of 
learning 
examples 

freq.  of 
rightly 
classd. 

freq.  of 
wrongly 
classd. 

1 

29 

5 

1 

2 

2 

25 

7 

mSSi 

1 

10 

Table  4;  Frequency  details  after  deletion  of  attributes  for  second  database 


■ 

freq.  of 
learning 
examples 

number 

of 

sub-des 

freq.  of 
rightly 
classd. 

freq.  of 
not 
classd. 

freq.  of 
wrongly 
classd. 

1 

28 

5 

26 

1 

2 

2 

25 

7 

14 

1 

10 

Table  5:  Frequency  details  for  the  second  experiment 


Class 

freq.  of 
learning 
examples 

number 

of 

sub-des 

freq.  of 
rightly 
classd. 

freq.  of 
not 
classd. 

freq.  of 
wrongly 
classd. 

1 

108 

15 

57 

1 

2 

2 

124 

14 

134 

0 

9 

Table  6:  Frequency  details  after  attributes  are  deleted 


■ 

freq.  of 
learning 
examples 

freq.  of 
rightly 
classd. 

freq.  of 
wrongly 
classd. 

1 

108 

11 

57 

1 

2 

2 

124 

12 

134 

0 
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Abstract* 


Fundamentally,  generalization  can  be  seen  as  a 
technique  for  performing  compression  of 
information  ;  the  result  of  this  process 
provides  the  basis  to  learn  new  knowledge. 
However,  in  all  domains,  the  use  of  an 
elementary  process  requires  tlic  implementation 
of  both  fast  and  efficient  algorithms.  In  this 
paper,  we  present  a  new  generalization 
mechanism,  inspired  by  the  structural 
matching  algorithm.  The  principle  of  our 
method  consists  in  using  the  domain  theory 
in  order  to  perform  the  saturation  of  the 
examples  given  by  an  expert.  We  will  see  that 
this  "rough"  approach  provides  some 
advantages  both  for  generalization  quality  and 
for  consUiiction  rapidity. 


1  Introduction 

At  the  present  time,  a  great  number  of  generalization 
systems  have  been  written  and  work.  So,  we  can  question 
the  neccessity  of  adding  a  new  one  to  this  list.  In  fact, 
KBG  corresponds  to  an  extension  of  some  existing  and 
running  systems  such  as  AGAPE  [Bollinger  86]  and 
OGUST  [Vrain  90],  both  based  on  the  Structural 
Matching  algorithm. 

The  system  KBG  is  a  generalizing  tool  belonging  to 
the  family  of  "Inductive  Learning"  and  more  accurately  to 
the  one  of  "ConsUuctive  Learning"  [Kodratoff,  Michalski 
90].  It  can  learn  a  recognition  function  of  a  concept  from 
an  example  set  which  illusttates  this  concept.  The  system 
has  been  developed  in  the  frame  of  the  PERSFICACE 
project  [Cannat  88]  whose  aim  was  to  apply  techniques 
of  Machine  Learning  to  the  domain  of  Air  Traffic  Control 
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(ATC)  in  order  to  build  an  Expert  Knowledge  Base  in 
Radar  Conflict  Resolution.  Previous  work,  directed  by 
the  CENA  ("Centre  d'Etudes  de  la  Navigation  Adrienne", 
Athis-Mons,  France),  has  pointed  out  the  difficulties  in 
modeling  the  tasks  of  air  tr^fic  controllers. 

In  this  real  world  domains  there  were  several  problems 
to  solve  such  as  :  the  presence  of  both  symbolic  and 
numeric  descriptors ;  the  necessity  to  express  the  data  in 
first  order  logic  because  each  entity  can  exist  with  a 
various  number  of  occurrences ;  the  presence  of  complex 
background  knowledge  containing  some  calculations 
between  the  descriptors.  Lastly,  the  great  size  of  the 
learning  set  (examples  and  domain  theory)  required  that 
any  system  rapidly  perform  generalization.  In  order  to 
implement  a  system  able  to  resolve  simultaneously  all 
these  constraints,  we  needed  to  greatly  modify  the 
existing  algorithms  and  we  developed  a  new 
generalization  method  mainly  based  on  the  previous 
saturation  of  examples. 

This  paper  is  organized  in  two  parts.  In  the  first  one, 
we  detail  our  motivations  and  the  advantages  of  our 
approach.  In  the  second  one,  we  present  and  explain  more 
accurately  the  running  of  the  system. 

We  must  emphasize  that  KBG  is  a  generalization  tool 
but  not  a  discrimination  one  :  indeed,  the  aim  of  this 
system  is  to  obtain  efficiently  a  most  specific 
generalization  from  an  example  set  (without  counter¬ 
examples),  but  not  to  build  automatically  a  set  of 
discrimination  rules  between  the  different  concepts. 
Thereby,  KBG  is  cunently  "just"  a  tool  allowing  the 
expert  (or  another  part  of  the  learning  system)  to  build  up 
more  easily  the  rules  modeling  the  application  domain. 

2  Structural  Matching 

2.1  Definitions 

We  will  begin  by  briefly  recalling  the  definitions  of 
generalization  and  of  Smictural  Matching  [Kodratoff  86] 
as  used  in  both  AGAPE  and  OGUST.  The  main  goal  of 
the  structural  matching  method  is  to  limit  the 
information  loss  during  the  generalization  process  by 
minimizing  the  use  of  the  dropping  rule  [Michalski  83]. 
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Defl :  We  say  that  an  expression  G  is  a  generalization 
formula,  modulo  a  domain  theory  T,  of  the 
examples  set  Ej,  E2, ....  E^  ,  if  there  exists  N 
substitutions  oj  such  that  for  each  i :  aj(G)  is 
included  into  Ej,  modulo  the  domain  theory  T. 

De£2  :  We  say  that  Ej,  E2, ...  E^  structurally  match 
if  there  exists  a  formula  F  and  N  substitutions 
aj  such  that  for  each  i ;  Oj(F)=Ei.  The  formula 
F  constitutes  a  correct  generalization  of  the 
examples.  To  put  the  examples  into  Structural 
Matching  aims  at  finding  F. 

In  order  to  put  the  examples  into  Structural  Matching 
[Ganascia  87],  [Kodratoff  88],  we  first  use  the  theorems 
and  taxonomies  given  in  the  background  knowledge  as 
well  as  the  commutativity,  associativity  and  idempotency 
properties  of  conjunction.  The  elementary  process  of  the 
Structural  Matching  algorithm  is  split  into  two 
successive  steps  which  are  repeated  until  there  are  no 
more  constants  within  the  examples : 

1)  The  system  chooses  a  constant  Oj  in  each 
example  and  turns  it  into  a  generalization  variable 
GV.  The  chosen  constants  must  be  as  "similar"  as 
possible:  these  choices  are  driven  by  some 
heuristics.  During  the  introduction  of  GV,  the 
corresponding  instantiations  are  memorized  into 
some  links  associated  with  each  example. 

2)  Then,  the  terms  containing  the  generalization 
variable  GV  are  matched,  taking  care  not  to 
modify  the  peviously  introduced  variables. 

In  the  second  stage,  in  order  to  satisfy  the  Structural 
Matching  definition,  we  apply  the  dropping  rule  to  the 
parts  of  the  examples  which  have  not  b^n  ueated  by  the 
previous  algorithm.  Finally,  the  generalization  formula  is 
easily  obtained  by  computing  the  intersection  between 
the  links  associated  with  each  example  and  by  keeping 
the  common  terms  of  the  examples. 

2.2  Algorithm  Analysis 

In  the  previous  algorithm,  informations  given  by  the 
domain  theory  are  used  according  to  the  needs.  Thereby, 
at  a  given  time,  the  set  of  facts  handled  is  relatively 
small.  So,  this  method  is  theoretically  usable  with  any 
size  of  examples  and  domain  theory.  However,  in 
practice  we  realize  that  we  often  use  the  knowledge  base 
and  that  therefore  some  pieces  of  information  arc  infered 
twice  or  more.  Indeed,  the  inferences  are  first  performed, 
when  the  system  is  searching  for  the  most  similar 
constants,  and  secondly,  when  it  puts  effectively  the 
terms  in  Structural  Matching.  All  these  computations 
slow  down  the  generalization  process.  If  this  problem  can 


be  solved  by  memorizing  the  deduced  instances,  there  is  a 
more  embarrassing  one  about  the  choice  of  the  theorems 
in  the  background  knowledge  to  apply.  As  a  matter  of 
fact,  for  lack  of  meta-knowledge,  the  choice  of  suitable 
rules  is  mainly  driven  by  syntactical  criteria.  As  we 
can  see  in  the  following  example,  this  solution  is  not 
totally  satisfying : 

Backiiround  knowledge : 

T1 :  IF  rect  (X),  blue  (X.Y)  THEN  flag  (X) 

T2  :  IF  rect  (X),  red  (X.Y)  THEN  flag  (X) 

T3  :  IF  square  (X),  blue  (X.Y)  THEN  signal  {7.) 

T4 :  IF  square  (X),  red  (X.Y)  THEN  signal  (X. 

T5 :  IF  square  (X)  THEN  shape  (X) 

T6 :  IF  rect  (X)  THEN  shape  (X) 

Examples : 

el :  rect  (a),  red  (a,  hot),  square  (b),  blue  (b,  light); 

e2 :  square(c),  red  ( c,  dark),  rect  (d),  blue  (d,  roy); 

By  using  simple  syntactical  criteria  to  choose  the 
rules,  such  as  the  relative  complexity  of  the  theorems 
fired  (number  of  premises  and  variables),  we  are  going  to 
match  the  constants  (a,  c)  and  (b,  d)  because  it  is  simpler 
to  match  the  colors  (red,  blue)  than  the  forms  (rect, 
square).  So,  we  obtain  the  following  generalization : 

G  =  shape  (X),  shape  (Y),  red  (X,  Z),  blue  (Y,T) 

But,  a  more  detailed  examination  of  the  examples  and 
of  the  domain  theory  suggests  that  the  crossed  matching 
(a,d)  and  (c,b)  is  much  better  in  terms  of  information 
content.  .In  this  way,  we  obtain  the  generalization : 

G'  =  rect  (X).  square  (Y),  red  (Z.  Wl), 

blue  (T,  W2).flag  (X).  signal  (i) 

2.2  Why  The  Saturation  Approach  ? 

Here,  our  argument  is  that  the  syntactical  criteria  can 
lead  to  some  misinterpretations  of  the  semantics  of  the 
domain.  Therefore,  for  lack  of  meta-knowledge  which 
would  allow  to  elaborate  a  coherent  strategy  for  the 
choice  of  rules,  we  think  that  ultimately  the  best 
approach  will  be  to  saturate  initially  the  examples  with 
the  domain  theory  provided  by  the  expert  This  solution 
may  seem  very  questionable,  so  we  are  going  to  discuss 
some  further  advantages : 

There  are  three  kinds  of  advantages.  Firstly,  the 
deduction  step  is  done  only  once,  when  the  system 
saturates  the  examples ;  thereby,  the  theorem  prover  used 
can  be  both  simple  and  efficient  because  it  just  consists 
in  a  forward  chaining  engine.  Secondly,  with  this 
approach,  when  the  system  is  searching  for  the  "most 
similar  constants"  it  already  knows  all  the  properties 
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verified  by  the  constants  ;  thus,  the  constants  are  chosen 
with  full  knowledge  of  the  facts  and  the  generalization 
will  be  more  relevant.  Last  but  not  least,  as  the 
deductions  are  done  once  and  for  all,  and  no  longer  for 
each  matching  step,  the  learning  will  be  faster  in  spite  of 
the  data  increase. 

Of  course,  the  main  drawback  of  this  approach  is  the 
possibility  of  running  into  combinatoritd  explosion. 
However,  several  limiting  factors  exist.  On  one  hand,  the 
facts  are  handled  at  the  example  level :  therefore,  if  there 
are  E  examples  and  N  facts  (terms)  on  average,  we  must 
execute  E  propagations  of  N  f^acts  instead  of  one 
propagation  of  ExN  facts.  On  the  other  hand,  the  example 
descriptions  are  usually  composed  by  less  than  a 
hundr^  facts;  moreover,  as  part  of  similarity  based 
learning  the  domain  theories  are  relatively  small.  Note 
that  the  use  of  recursive  rules  in  the  background 
knowledge  is  not  a  problem  as  long  as  they  do  not 
generate  an  infinite  list  of  instances ... 

Practically,  during  the  project  PERSPICACE,  we  have 
used  without  problem  a  domain  theory  containing  2S0 
complex  rules  (with  recursions)  and  20  examples  of  50 
terms  each  [Cannat  88]. 

3  The  Algorithm 

3.1  Introduction 

The  generalization  process  is  performed  by  the  system 
in  two  steps.  Firstly,  using  the  background  knowledge 
given  by  the  user,  the  system  completes  the  examples 
with  the  help  of  a  forward  chaining  engine  (saturation). 
Secondly,  the  learning  process  begins,  working  on  the 
fact  bases,  one  for  each  example,  built  with  the  help  of 
the  inference  engine.  The  generalization  formula  will  be 
composed  of  only  the  predicates  which  have  at  least  one 
instance  in  each  fact  base,  expression  of  the  saturated 
examples.  This  set  of  predicates  is  named  GP.  The 
generalization  is  performed  predicate  by  predicate  .  For 
each  predicate,  we  gather  the  "most  similar"  instances 
together,  in  order  to  turn  them  into  variables.  Roughly, 
the  algorithm  is  the  following  ; 


*  For  each  predicate  P  belonging  to  the  GP  set 
*  Until  there  remains  one  ungeneralized  instance 

-  Match  the  most  similar  instances  of  P. 

-  Name  the  generalization  variables  Vi  and 

create  the  links  between  them. 

-  Append  the  new  predicate  P(V1, ...)  to  the 
____^__generalizalionjMmuIa^_^_^_^^^^ 

Firstly,  we  have  to  define  the  similarity  computation 
between  the  instances  of  a  given  predicate.  In  order  to 
execute  this  computation  as  fast  as  possible,  we  need  to 
index  the  information  used ;  to  achieve  that,  we  build  for 
each  of  the  constants,  composing  the  examples,  a  list 


gathering  the  typically  verified  properties.  This  list  will 
be  call  the  "signature  of  the  constant". 

3.2  Signature  Between  Constants 

The  "signature  of  the  constant",  noted  SIG,  is  the 
basis  of  the  similarity  computation.  It  expresses  an 
exhaustive  abstract  of  the  properties  verified  by  the 
constant  in  the  fact  bases.  Since  we  saturate  the  examples 
at  the  beginning  of  the  learning  session,  the  signature 
need  only  be  computed  once,  the  end  of  the  saturation 
step.  In  KBG,  we  can  split  the  constants  into  two  types 
that  we  call  "objects"  and  "values"  (notice  that  the 
previous  algorithms,  did  not  handle  the  values).  The 
objects  are  constants  without  semantics  :  in  other  words, 
their  significance  is  only  provided  by  the  predicates  in 
which  they  occur.  The  v^ues  are  typed  constants  and 
contain  their  own  meaning.  Here  is  an  example : 

Let  the  type  "size"  which  handles  the  values : 

(small,  medium,  large,  giga) 

Let  the  expression : 

E  =  square(a)  &  outsize(a,  small)  &  weight(a,  50) 

According  to  our  previous  definition,  the  constant  "a" 
is  an  object  and  the  constants  "small"  and  "50"  are 
values.  We  can  easily  see  the  different  semantics 
associated  with  these  two  kind  of  constants :  the  objects 
express  links  between  the  properties  (predicates)  of  the 
same  entity,  whereas  the  values  quantify  those  properties. 
Consequently,  the  signatures  of  objects  and  values  are 
different.  In  the  case  of  an  object,  its  signature  will  be 
composed  by  the  list  of  the  occurrences  (predicates, 
positions)  of  this  constant  within  the  example : 

E  =  square(a)  &  outsize(a,  small)  &  weight(a,  50) 
SIG  (a)  =  (square!  outsizel  weightl) 

The  values  are  completely  local  to  the  predicates  in 
which  they  occur,  thereby  Oie  signature  of  a  value  is 
itself.  For  instance :  SIG(50)=50. 

*  Redundant  instances : 

However,  the  completion  process  can  sometimes  have 
some  drawbacks  !  Indeed,  the  same  information  can  be 
deduced  several  times.  Consider  this  problem : 

El  =  color  (A)  &  square  (A) 

E2  =  color  (B)  <6  square  (C) 

We  also  know  the  two  rules  : 

R1 ;  square  (x)  =>  rectangle(x) 

R2  :  rectanglc(x)  =>  shape(x) 

After  the  completion  of  the  examples,  we  obtain  the 
following  signatures : 

SIG(A)  =  (color  square  rectangle  shape) 

SIG(B)  =  (color) 

SIG(C)  =  (square  rectangle  shape) 
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In  this  case,  for  the  examples  El  and  E2,  the 
occurrences  "rectangle"  and  "shape"  are  not  relevant 
because  they  provide  no  information  with  respect  to 
"square".  This  is  an  annoying  problem  for  two  reasons  : 
firstly,  the  signature  comparison  will  be  perverted  and 
consequently  the  generalization;  secondly,  we  will  have 
an  unnecessarily  large  number  of  instances  to  treat. 
Thereby,  we  need  to  delete  this  kind  of  information  when 
the  completion  step  is  achieved.  This  operation  will  be 
easily  performed  with  the  following  algorithm,  which 
complexity  is  linear  in  accordance  to  the  number  of 
instances  in  the  facts'  base : 

*  For  each  predicate  P  owning  at  least  one  instance : 

If  there  is  in  the  background  knowledge  a  theorem 
T  whose  premises  are  made  with  the  predicate  list 
(Ql...Qn)  named  LP,  such  that : 

1)  All  the  instances  of  P  have  been  deduced  by 
the  theorem  T  (at  least) 

2)  Let  LPI  the  predicate  list  of  LP  whose 
arguments  are  identical  to  those  of  P. 
Instances  of  P  must  be  identical  to  those  of 
one  of  the  predicates  of  LPI  (<=>  same 
number  of  instances). 

Then  the  instances  of  P  are  not  relevant,  thus : 

*  If  GP  owns  P  then  P  is  not  generalized. 

*  P  does  not  appear  in  the  constant  signature. 

3.3  Similarity  Between  Constants 

In  first  order  logic,  the  examples  can  be  represented  as 
graphs  where  the  nodes  are  the  constants  of  the  examples 
and  the  edges  the  predicates  (or  properties)  linking  these 
constants.  So,  in  this  representation,  the  generalization 
problem  becomes  one  of  finding  the  "greatest"  subgraph 
common  to  all  the  examples  :  unfortunately,  this  is  a 
problem  !  Thus,  to  match  correctly  and  rapidly  these 
constants,  we  consider  that  the  generalization  process 
consists  of  the  gathering,  under  the  same  variable  name, 
of  the  most  similar  constants.  In  order  to  achieve  this,  we 
are  going  to  define  a  notion  of  similarity  (distance) 
between  constants.  This  measure  is  noted  SIM  and  its 
computation  depends  on  the  constant  type. 

*  Similarity  between  objects  : 

In  the  case  of  objects,  we  Just  look  at  the  common 
properties  between  ^e  two  signatures,  for  instance,  by 
measuring  the  length  of  the  lists'  intersection.  However, 
It  is  necessary  to  balance  this  measure  with  the  number 
of  differences  observed.  Otherwise  the  constants  owning  a 
lot  of  properties  will  be  match  together.  Practically,  we 
use  the  following  formula ; 


SIM  (X,Y)  =  LENGTH  (SIGOOnSIG  mi 
LENGTH  (SIG  (X)  u  SIG  ( Y)) 


We  must  emphasize  that  by  sorting  the  signature  list 
before  computing  the  similarity,  the  complexity  of  this 
operation  b^omes  linear  in  relation  to  the  number  of 
items.  Here  is  a  computation  example : 

El=  square(a)  &  red(a)  &  small(a) 

E2-  rect(b)  &  red(b)  <6  square(c)  &  green(c)  &  snuill(c) 
SIG(a)  =  (square  small  red); 

SIG(b)  =  (reel  red); 

SIG(c)  =  (square  small  green) 

SIM(a.b)  =  113 ;  SIM(a.c)  =  2/3 

*  Similarity  between  values : 

In  the  case  of  values,  the  similarity  measure  is 
computed  between  the  values  owned  by  the  same 
predicate,  according  to  the  constant  type.  For  instance,  if 
the  constants  belong  to  an  ordered  type,  we  can 
classically  define  the  measure  of  similarity  as : 

Let  the  type  "size" :  (small,  medium,  large,  giga) 

SIM  (a,b)  =  (Interval-length  -  Place-differencelaJ?)) 

Intervaljength 

SIM  (small,  medium)  =  3/4  =  0,75 

SIM  (medium,  giga)  =  2/4  =  050 

3.4  Similarity  Between  Instances 

Practically,  the  predicates  are  n-tuplets  and  the 
arguments  composing  their  instances  are  objects  and 
values.  Therefore,  for  each  argument,  the  computation  of 
the  similarity  will  be  performed  according  to  the  constant 
type.  For  a  N  arity  predicate,  the  distance  between  two 
instances  II  and  12  will  be  defined  as  the  average  of  the 
similarity  between  each  arguments  Al,^  and  A2j,. 

3.5  Matching  Between  Instances 

The  matching  of  the  most  similar  instances  of  a 
predicate  P  is  the  basic  operation  which  allows  to  create  a 
new  term  in  the  generalization.  It  consists  of  selecting  for 
each  example  Ej,  one  instance  Ij  of  P  in  the  fact  base 
associed  to  Ef,  taking  care  to  maintain  as  far  as  possible 
the  greatest  similarity  between  the  items  of  the  resulting 
vector  (Ij ...  Ig).  However,  for  a  given  predicate,  we  can 
not  compute  the  optimum  matching;  indeed,  if  we  have  E 
examples  and  an  average  of  N  instances  for  each  one,  the 
complexity  of  this  computation  is  exponential  in  0(N^). 

So,  we  use  the  following  heuristic  to  obtain  a 
reasonable  complexity.  The  elementary  step  of  the 
process  is  to  select  in  the  fact  base  associated  with  an 
example  one  instance  (not  yet  generalized)  I ;  then,  for 
each  remaining  example  we  choose  the  most  similar 
instance  (generalized  or  not).  Finally,  we  obtain  a 
generalizable  instance  set.  The  computation  is  finished 
when  all  of  the  Instances  are  treated.  The  great  advantage 
of  this  method  is  firstly,  its  simplicity  and  secondly,  a 
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reasonable  complexity  in  o(N^.E).  Another  advantage  of 
this  heuristics  is  that  we  can  easily  control  the 
application  of  idempotence  (I  =>  I&I),  by  modifying  the 
stopping  test  of  the  previous  algorism  ;  if  we  do  not 
ne^  idempotency,  the  system  will  match  only  the 
instances  not  yet  generalized.  The  main  drawback  is 
that  the  quality  of  the  matching  depends  a  lot  on  the 
choice  of  the  farst  instance ;  thereby,  this  heuristic  can  be 
very  noise  sensitive ! 

3.6  Links  Between  Variables  and  Constants 

The  instance  matching  process  provides  us  with  a 
vector  V :  (Ii ...  Ie).  in  which  each  instance  is  composed 
of  a  list  of  N  constants,  N  representing  the  arity  of  the 
current  predicate.  This  vector  can  be  seen  as  a  matrix  of 
constants  Mjq^E  '  way,  creating  a  variable 

corresponds  to  grouping,  under  the  same  symbol  name, 
one  column  of  this  matrix.  For  a  given  data  type,  the 
different  variables  built  during  die  generalization  process 
are  not  totally  independent  because  their  definidons  often 
have  some  common  parts  (list  of  substituted  constants). 
Therefore,  independently  of  the  variable  type  (object  or 
value)  we  must  define  some  links  between  ^e  variables, 
specifying  the  coherence  constraints : 

*  (X  =  Y) :  both  variables  are  always  instanciated 

with  the  same  items. 

*  (X  Y) ;  both  variables  are  never  instanciated  with 

the  same  items. 

*  (X  ~  Y) :  the  variables  can  be  either  the  same  or 

different  (May  fie  the  £ame). 

Traditionally,  the  MBS  links  are  not  explicitly  given 
in  the  generalizaUon  :  as  a  matter  of  fact,  in  logic  when 
two  variables  have  the  same  name,  they  can  be  either 
equal  or  different.  On  the  other  hand,  when  we  compute  a 
generalization,  we  notice  that  the  MBS  links  are  less 
numerous  than  the  difference  ones.  Therefore,  in  order  to 
increase  the  readability  of  the  result,  we  have  chosen 
another  convention  in  our  system  :  it  is  the  difference 
links  which  are  implicit. 

*  Exclusive  links : 

The  MBS  links  are  not  always  simultaneously  true. 
For  example,  we  can  write  the  links  (X~Y)  and  (Y~Z) 
without  having,  in  the  examples,  the  equality  X=Y=Z  ; 
in  this  case,  we  have  lost  some  information.  Therefore, 
the  relation  between  (X~Y)  and  (Y~Z)  must  be  expressed 
as  a  logic  NAND  rather  than  a  simple  AND.  We 
illustrate  diis  point  in  die  following  example ; 

El  =  reel  (a)  &  red  (a)  &  reel  (b)  &  large  (b) 

E2  =  reel  (c)  &  red  (c)  &  reel  (d)  &  red  (d)  &  large  (d) 

E3  =  reel  (e)  &  red  (e)  &  rect  (f)  &  large  (f) 


G  =  rect{X)  &  rect(Z)  &  red(X)  &  red(Y)  &  large(Z) 

&  (X~Y)  &  (Y~Z) 

with  X  =  (ace),  Y  =  (ade),  Z  =  (b df) 

We  need  to  be  able  to  detect  this  kind  of 
information.  Saying  that  two  links  LI  and  L2  are  verified 
at  the  same  time  means  that  the  intersection  between 
the  example  sets  where  LI  is  true  and  where  L2  is  true, 
is  empty.  Then,  the  idea  is  to  memorize  for  each  MBS 
link  the  list  of  examples  verifying  this  link,  in  the 
form  of  a  vector  V  of  bits,  in  which  each  bit  codes  an 
example;  we  will  call  this  vector  V  the  "truth  table"  of 
the  link.  Thus,  if  the  intersection  between  two  tables 
is  empty,  we  are  able  to  say  that  the  corresponding  links 
are  exclusive  ones. 

exl  ex2  ex3 

LI  =  (X~y)  =  (  1  0  1  ) 

L2  =  (Y~Z)  =  (  0  1  0  ) 

& - 

(  0  0  0  ) 

=>  The  links  LI  andL2  are  exclusive. 

Furthermore,  this  method  allows  us  to  establish  some 
other  relations  between  the  MBS  links  and  therefore,  to 
obtain  a  more  specific  generalization.  Previously,  we 
have  looked  at  the  resolution  of  the  logical  equation 
(Truth  table  1  AND  Truth  table  2  =  0);  however,  it  is 
not  the  only  interesting  one.  Let  two  links  LI  and  L2, 
and  their  associated  truth  tables  T1  and  T2 ; 

♦If  (T1  AND  T2  =  0) 

we  have  the  link :  (LI  NANDL2) 

♦If  (T1  AND  T2  =  T2) 

we  have  the  link :  (LI  =>L2) 

♦If  (T1  AND  T2  =  T1) 

we  have  the  link :  (L2  =>  LI) 

♦If  (T1  =  T2) 

we  have  the  link :  (LI  <=>  L2) 

♦  If  (T1  XOR  T2  =  I) 

we  have  the  link :  (LI  XOR  L2) 

The  study  of  the  relations  between  two  MBS  links 
can  be  generalized  to  relations  between  several,  in  order 
to  bring  to  the  fore  some  more  complex  relations  such 
as  (LI  &  L2  =>  L3),  etc  ...  This  problem  has  been 
examined  by  [Ganascia  87].  Nevertheless,  if  this  kind 
of  information  is  logically  relevant,  it  often  provides 
some  uninteresting  information  and  its  computation  is 
time  consuming. 

♦Type  dependant  links: 

The  previous  links  are  defined  for  every  kind  of 
variables.  However,  when  the  instances  of  the  variables 
are  values,  we  can  add  some  information  to  the 
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generalization  formula  in  order  for  it  to  become  more 
specific.  The  functions  which  compute  this  kind  of 
information  are  type  dependant.  The  usual  links  can  be 
split,  in  two  families  :  unary  links  and  binary  ones.  The 
unary  links  express  a  synthesis  of  the  values  handled  by 
each  variable  V.  For  example,  if  the  instances  of  V 
belong  to  an  ordered  type,  we  can  add  the  interval  of 
values  to  the  generalization.  The  binary  links  express  the 
observable  relations  between  the  values  of  two  variables 
VI  and  V2  belonging  to  the  same  type.  These  links  will 
be  created  by  comparing  the  definitions  of  the  variables. 
For  instance,  in  the  case  of  integer  values,  we  can  add 
some  links  such  as  "less"  or  "greater  than"  between  the 
variables,  in  the  generalization.  Here  is  an  example : 

El  =  age  (a, 9)  &  teeth-nbr  (a^S) 

E2  =  age  (bi20)  &  teeth-nbr  (b^2) 

G  =  age  (XA)  &  teeth-nbr  (XJ^)  & 

(A<N)  &  (A  e  19,20])  &  (N  e  [28,32]) 


in  all  the  examples  the  value  of  X  was  always  less 
than  the  value  of  Y.  In  this  part,  we  provide  these 
kind  of  dependencies. 

5  -  Object  links  :  Implicitly,  in  order  to  increase  the 

readability  of  the  result,  when  two  variables  always 
have  a  different  substitution  in  the  examples,  their 
names  in  the  generalization  are  different.  Here  we 
provide  the  links  MBS  (May  Be  the  Same)  which 
express  that  two  variables  can  be  either  equal  or  not. 

6  -  MBS  links  :  Here,  we  provide  the  relations  between 

the  MBS  links  (Nand,  Xor, ... )  described  above. 


Here  is  a  simple  example  of  a  generalization  process : 


size :  ordered  type  (micro, small, medium, large^iga); 


3.7  Results  of  Generalization  and  Example 

In  the  system  KBG,  the  examples  are  expressed  in  the 
form  of  a  conjunction  of  instanciated  terms.  The  domain 
theory  is  composed  of  two  parts  :  the  first  part  is  the 
declaration  of  the  data  types  used,  and  the  second  the 
production  rules  "IF  ...  THEN"  expressing  the  relations 
between  the  predicates.  Moreover,  the  user  can  easily 
define  his  own  types  of  values  by  providing  to  the 
system  tlie  LISP  functions  expressing  the  behavior  of 
these  items  ;  he  can  also  pui  some  external  procedural 
calls  into  the  premisses  and  L.j  conclusions  of  the  rules 
(for  instance,  to  perform  some  computation). 

In  the  current  implementation,  the  result  of  the 
generalization  consists  of  six  different  parts : 

1  -  Generalization  :  A  conjunction  of  predicates,  the 

arguments  of  which  are  variables  or  constants.  These 
predicates  express  the  common  properties  and 
relations  found  in  the  example  set. 

2  -  Object  matching :  The  generalization  process  is  based 

on  the  matching  of  the  most  similar  constants  (one 
per  example).  Thereby,  each  variable  of  the 
generalization's  formula  corresponds  to  a  list  of 
constants.  In  this  part,  we  provide  all  these 
corresponding  substitutions. 

3  -  Intervals  of  values :  In  the  case  of  typed  constants,  for 

each  corresponding  variable  we  provide  in  this  part  the 
intervals  of  possible  values. 

4  -  Value  links :  In  the  case  of  typed  constants  in  which 

the  elements  are  ordered,  we  can  sometimes  observe 
some  regularities  as  "X<Y".  This  relation  means  that 


R1 :  IF  near  (X,Y)  WEN  near  (YPl)  / 
R2  :  IF  on  (X,Y)  THEN  near  (X,Y) ; 


el :  rect  (a),  red  (a),  size  (a,  large),  circle  (b), 
on  (a,  b),  size  (b,  small); 

e2  :  ellipse  (c),  red  (c),  size  (c,  micro),  triangle  (d), 
on(d,c),size(d,giga); 


1)  Generalization  of  the  examples :  (el  e2) 

G  =  ellipse  (CO),  shape  (Cl),  on  (Cl, CO),  red  (C2) 
outsize  (CO.TAO),  outsize  (Cl.TAl). 

2)  Matching  :  (CO  :b  c),  (Cl ;  a  d),  (C2  :  a  c), 

(TAG  :  micros  small),  (TAl :  large  giga). 

3)  Interval  of  values :  (TAO  e  [  micro  ..  small  ]), 

(TAl  e[ large. .giga]). 

4)  Links  between  values  :  (TAl  >=  TAO). 

5)  Objects  Links  (MBS) :  (LI  :  C2'-C0), 

(L2  :  C2~C1). 

6)  MBS  links  :  (LI  xor  L2). 
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4  Conclusion 

The  generalization  process  can  be  seen  as  an  operation 
aiming  to  extract  and  to  compact  the  meaningful 
information  held  in  a  example  set.  The  result  of  this 
operation  can  be  used  by  a  higher  conceptual  level 
mechanism  or  even  by  another  external  system  in  order  to 
build  a  knowledge  base  modeling  the  stutUed  domain. 

However,  using  generalization  as  an  elementary 
process  requires  that  it  is  both  accurate  and  fast.  We  think 
that  the  completion  strategy  and  the  algorithms  used 
in  our  system,  allow  us  to  achieve  these  goals.  In  the 
project  PERSPICACE,  the  system  KBG  was  used  just  as 
a  tool  for  helping  in  tiie  construction  of  the  knowledge 
base.  Thus,  a  large  part  of  the  processes  were  performed 
manually.  At  the  present  time,  we  are  developing  two 
new  components,  using  the  generalizer  results,  in  order  to 
increase  the  automatic  parts  in  the  rule  construction. 

On  the  one  hand,  currently  when  we  generalize  an 
example  set  illustrating  one  concept,  we  obtain  a  large 
recognition  function  in  which  all  the  terms  are  not 
necessarily  meaningful  *,  so,  the  management  of  the 
counter-examples  will  allow  us  to  perform  an  "over- 
generalization"  of  the  learnt  knowledge.  In  this  way,  we 
will  obtain  a  more  compact  and  significant  information. 

On  the  other  hand,  the  system's  capacity  to  explain  its 
computation  is  fundamental  t  indeed,  when  the  expert 
does  not  agree,  it  is  often  very  difficult  to  find,  just  by 
reading  the  generalization,  the  modifications  to  the 
knowledge  representation  that  are  required.  We  have  seen 
that  the  generalization  process  consists  of  gathering  the 
most  similar  objects.  In  our  case,  the  completion  of  the 
examples  vouches  for  the  fact  that  the  matching  will 
roughly  reflect  all  of  the  properties  expressed  in  the 
background  knowledge.  Thus,  the  idea  is  to  translate  in  a 
readable  form  the  information  provided  by  the 
similarity  computation  and  to  use  it  to  establish  a 
direct  dialog  between  the  learning  system  and  the  expert. 
The  implementation  of  explanations  will  also  be  the 
basis  on  which  to  elaborate  a  conceptual  clustering  of 
the  examples,  as  a  first  step  towards  automatic 
knowledge  base  generation. 
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Abstract 

Tlii^  paper  presents  a  new  metlind  for 
acquiring  classificatory  knowledge  by 
induction.  Based  on  a  probabilistic  infer¬ 
ence  technique,  this  method  allows 
inherent  patterns  in  noisy  training 
instances  to  be  ecosily  detected.  A  set  of 
classifleation  rules  can  then  be  constructed 
based  on  these  detected  patterns.  The 
proposed  method  has  been  evaluated  by 
testing  it  with  some  real-world  data  sets 
and  the  results  show  that  it  out-performs 
some  decision-tree  based  algorithms  both 
in  terms  of  computational  efficiency  and 
classification  accuracy. 

1.  Introduction 

Given  a  collection  of  objects  (events,  observa¬ 
tions,  situations,  processes,  etc.)  that  are 
described  in  terms  of  one  or  more  attributes  and 
are  preclassified  into  a  number  of  known  classes, 
the  classification  problem  is  to  find  a  set  of 
characteristic  descriptions  for  these  classes  or, 
equivalently,  a  procedure  for  identifying  an 
object  as  belonging  to,  or  not  belonging  to  a  par¬ 
ticular  one. 

Many  inductive  learning  systems  have  been 
developed  to  solve  the  classification  problem  and 
the  decision-tree  based  program  IDS  is  one  of  the 
best  known  among  them  (10).  It  has  been  suc¬ 
cessfully  employed  for  a  wide  range  of  applica¬ 
tions  including  several  industrial  projects  (12).  In 
order  to  further  improve  the  performance  of  IDS 
when  classification  have  to  be  made  in  the  pre.s- 
enc"  of  uncertainty,  a  pre-pruning  method  based 
on  the  use  of  the  chi-square  test  has  been 
employed  [12-13].  The  problem  with  the  use  of 
this  test  is  that  the  number  of  training  instances 
satisfying  the  path  that  is  being  constructed 
becomes  less  and  less.  When  the  nninber  falls 
below  a  certain  threshold,  the  chi-square  test 
cannot  be  validly  used  [12].  Therefore,  such 
technique  can  be  employed  only  when  a  large 
training  set  is  available. 


The  other  problem  with  pre-pruning  of  deci¬ 
sion  trees  is  that  pruning  derision  is  made  based 
on  local  information  alone.  Due  to  limited  ’loo¬ 
kahead’,  this  process  may  terminate  too  early, 
leaving  out  important  information.  To  avoid  this 
problem,  post-pruning  of  alrciidy-formed  decision 
trees  has  been  proposed.  The  current  post- 
pruning  strategy  adopted  by  ID3  uses  a  set  of 
production  rules  equivalent  to  the  tree.  [12-13]. 
Based  on  Fisher’s  Exact  Test,  the  conditions  on 
the  left-hand  side  of  each  rule  are  examined  to 
determine  if  they  are  relevant  for  cliissification. 
Conditions  that  are  irrelevant  are  discarded  from 
a  rule.  The  set  of  pruned  rules  are  then  further 
simplified  by  throwing  away  those  whose  omis¬ 
sion  would  not  lend  to  more  misclassified 
instances.  As  observed  in  [12],  this  algorithm  is 
rather  slow  and  inefficient  when  compared  to 
other  post-pruning  strategies.  Furthermore,  .since 
hill-climbing  strategies  are  used  for  pruning  the 
conditions  and  rules,  important  information  may 
be  lost  due  to  a  local  optimum  [13]. 

2.  An  Efficiont  Classification 
Method 

To  efficiently  handle  uncertainty  in  classification 
tasks,  we  have  developed  an  inductive  learning 
nielhod  based  on  a  powerful  probabilistic  infer¬ 
ence  technique  [1-2].  It  can  uncover  patterns 
underlying  a  set  of  noisy  data  and  is,  for  this  rea- 
.son,  particularly  effective  in  dealing  with  uncer¬ 
tainty.  The  method  consists  of  three  phases:  1) 
detection  of  underlying  patterns  in  training  data; 
2)  construction  of  classification  rules  based  on 
the  detected  patterns;  and  3)  use  of  these  rules 
for  object  classification. 


jue V Ca  ji  ca ii iix  iiOisjr  jL/ni/Ct 


Suppose  a  trcaining  set  consists  of  M  objects 
de.scribed  by  n  attrilnites,  Attry,  •  •  ■  ,  Attr„,  so 
that  in  an  instantiation  of  object  description, 
Aitv^,  l<j<r),  takes  on  valj  (domain{Atirj) 
=1,  ■  ■  •  ,./}.  Supiwse  that  the  M  objects 
are  preclas.sified  into  P  known  cln.sses,  Cj„ 
71=1,  •  •  •  ,P,  the  clnssificaiion  problem  is  to  find 
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a  set  of  cliatactetistic  descriptions  for  these 
classes.  To  discover  patterns  underlying  this  set 
of  noisy  training  data,  attributes  important  for 
object  class  determination  need  to  be  identiflod. 
To  acliieve  tliis,  many  systems  use  tlie  chi-square 
test  to  find  attributes  that  ate  statistically  depen¬ 
dent  on  the  classes  [1 1). 

Let  Ojj(.  be  the  total  number  of  objects  in  the 
training  set  that  belong  to  Cj,  and  ate  character¬ 
ised  by  t'ji  and  let  be  tlie  expected  number  of 
such  objects  under  the  assumption  that  the 
values  of  Aiirj  ate  randomly  distributed  in  c,,. 
Then,  the  chi  square  statistic  can  be  defined  as: 


M  .  (1) 

®Jifc  ;~U-1 


where  M  =  X"®;*  “  equal  to  M  (the 

I'-fc 

total  number  of  objects  in  the  training  set)  due 
to  the  possibility  of  having  missing  values  in  the 
data.  If  A*^  is  greater  than  the  critical  value, 
where  rf=(P— 1)(J— 1)  is  the  degree  of  free¬ 
dom  and  a,  usually  taken  to  be  0.05  or  0.01,  is 
the  significance  level  (i.e.  (l-0')%  is  the  confi¬ 
dence  level),  then  one  can  conclude  that  Attrj  is 
statistically  dependent  on  the  classes  and  it  is 
therefore  Iteipful  in  determining  the  class 
membership  of  an  object.  Otherwise,  there  is  not 
enough  evidence  to  support  such  a  conclusion. 

However,  the  chi-sou  .re  test  indicates  only 
that  an  attribute  is  important  for  classification 
but  not  which  particular  value,  say  Vj,,  of  Atirj  is 
helpful  in  determining  if  an  object  belongs  to  a 
class,  say  o,,.  Such  a  value  can,  however,  be 
identified  if  Pr(object  is  in  Cj,  |  AUrj  =  Uj;),  is 
significantly  different  from  Pr(objec.t  is  in  Cj.) 
(i.e.  Ojj.  should  deviate  significantly  from  e,* 
[1-2]).  Since  the  absolute  difference  Cj*} 

does  not  provide  information  on  the  relative 
degree  of  the  discrepancy,  it  needs  to  be  stand¬ 
ardized  to  avoid  the  influence  of  the  marginal 

totals.  Haberman  (7)  recommended  using  the 

adjusted  residual: 

w[  <2) 


being  the.  total  nnmber  of  objects  in  the 
training  set  that  are  in  c,,  and  the  total 
nnmber  of  training  objects  having  the  charac¬ 
teristic  Vj^  [7]. 

For  p  =  1,  2,  •  •  ■  ,  P,  and  k  =  1,  2,  •  •  ■  ,  J, 
if  the  absolute  value  of  an  adjusted  residual,  s.ay 
djj.,  is  greater  than  1.96,  the  95  percentiles  of  the 
normal  distribution  (or  99  percentiles  or  higher 


for  a  greater  confidence  level),  we  can  conclude 
that  tlie  discrepancy  between  Oj*.  and  e,*  (i.e. 
between  Pr(object  is  in  Cj, }  Attrj  =*  vj)  and 
Pr(objec.t  is  in  Cj,))  is  significantly  different  and 
therefore  iijj  is  important  for  the  classification  of 
the  objects  it  chfxracterizes.  The  sign  of  dj^  is 
also  import.nnt.  A  (f,*>-i-1.9C  indicates  that  the 
presence  of  Vj  is  a  relevant  value  of  Cj,  whereas  a 
rfjji.<-1.96  indicates  it  is  more  unlikely  for  an 
object  characterised  by  Vj  to  be  a  member  of  Cj, 
than  other  classes. 

Values  of  Attrj  showing  no  correlation  with 
any  class  yield  no  classificatory  information. 
These  values  are  considered  as  irrelevant  for 
learning.  Their  inclusion  may  cause  overfitting 
and  the  generation  of  misleading  classification 
rules.  Hence,  they  are  discarded  from  further 
analysis. 


2.2.  Construction  of  OInssifleation  Rules 

To  use  the  detected  relevant  values  for  each 
object  class,  they  are  explicitly  represented  in  the 
classification  rules.  Suppose  that  Vj;  is  a  relevant 
value  of  Cj„  the.  following  rule  is  constructed. 

If  Attrj  of  an  object  is  vj  then  that  this  object 
belongs  to  Cj,  is  with  weight  of  evidence 
1F(  Class=!Cj,/Class?^Cj,lAtirj=vj)  , 

where  lV(C/a5s=i:j,/C/n.M?tCj,jA«rj=v^)  measures 
the  amount  of  positive  or  negative  evidence  pro¬ 
vided  by  I’j,  supporting  or  refuting  an  object  that 
it  characterizes  to  be  cla,sslfied  into  Cj,. 

The  derivation  of  lV(Class=Ci,  /Classj^Cj, 
\A(tr^  =iij;)  is  I).a.sed  on  an  information  theoretic 
me.asure  known  as  tlie  mutual  information.  The 
mutual  information  lietwcen  Cj,  and  the  relevant 
feature,  1'^,  is  defined  as  [8,  16): 


/( Clnss=Cj,:Attrj=vj) 

Pr(  Class=c^,\Attrj=Vj) 
^  Pr{Class=Cj,) 


so  that  I{Class=Cj,  :  Attrj=Vj)  is  positive  if  and 
only  if  Pr(Class=Cj,\Attrj=vj)  >  Pr(Class=Cj,) 
otherwise  it  is  either  negative  or  has  a  value  0. 
Based  on  this  measure,  the  weight  of  evidence 
provided  by  Vj^  in  favor  of  an  object  being  a 
member  of  Cj,  as  opposed  to  its  not  being  a 
member  of  the  class,  can  be  defined  as  follows  [8]: 

1F(  Class=Cj,/Cla.'is?iCi,\A  ttrj=  Vj ) 
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= J(  Class=:Cj;.AUrj=v^)—l(  ClassyiCj;.AUrj=  v^) 
Pr{AUrj=vA,Chss=c,.) 
Pr{AUrj=Vj,\Class7^Cj,)  ' 

In  other  words,  W  nmy  be  interpreted  os  the 
difference  in  information  about  an  object  being  in 
Cj,  compared  to  other  classes  given  that  it  is 
characterised  by  Vj^  [8].  W  is  therefore,  in  its 

technical  sense,  the  log  of  the  likelihood  ratio.  It 
should  be  noted  that  the  likelihood  ratio  is  well 
known  among  Bayesian  statisticians.  It  is, 
perhaps  for  this  reason,  that  it  is  also  quite  popu¬ 
lar  among  AI  researchers  as  a  scheme  for  uncer¬ 
tainty  representation.  For  example,  Duda  et.  al 
found  its  use  quite  convenient  for  plausible  rea¬ 
soning  in  Prospector  [5].  .Schlimmer  et.  al  also 
adopted  it  in  the  STAGGER  system  (14). 

The  class  description  of  a  class  Cj,  can  be  con¬ 
sidered  as  the  .subset  of  classification  rules  whose 
conclusion  parts  predict  Cj,.  These  rules  de.scribe 
each  class  of  training  objects  probabilistically.  It 
should  be  noted  that,  like  the  decision-tree  based 
inductive  learning  systems  which  determine  the 
informaiion  gain  [10]  of  each  attribute  indepen¬ 
dent  of  the  others  at  a  node,  the  proposed 
method  also  evaluates  the  weight  of  evidence 
provided  by  an  attribute  value  for  classification 
independently.  Hence,  each  classification  rule 
represents  the  amount  of  evidence  provided  by  a 
single  attribute  value  for  or  against  an  object’s 
beiiig  assigned  to  a  certain  class.  Conjunctive 
rules  can,  however,  be  formed  by  combining  the 
first-order  rules.  For  example,  consider  the  fol¬ 
lowing  rules; 

1.  If  AUrj^Vj  then  Class =c  with  lV=w,. 

2.  If  Attrj=Vj  then  Class=c  with  W=ii}j. 

These  rules  can  be  combined  to  form  Rule  No.  3: 

3.  If  AUri=Vi  and  Attr==Vj  then  Classic  with 

In  other  words,  high-order  conjunctive  rules  can 
be  combined  to  form  low-order  ones  if  the  conclu¬ 
sion  parts  of  the  low-order  rules  are  the  same.  If 
the  condition  parts  of  two  or  more  rules  involve 
the  same  attribute,  they  can  be  combined  to 
form  disjunctive  rules.  For  example,  consider  the 
following  rules: 

«.•  la  vltVAi  V'aui’O  —  C  vTlvia  rr— 

5.  If  A«r,=u,,  then  Class=c  with  IF=W;.. 

These  rules  can  be  combined  to  form  Rule  No.  6: 

6.  If  or  then  C'lass=c  with 

W=Wf^  or 


Though  not  explicitly  represented  as  high- 
order  rules,  the  combination  of  the  Inw-order 
ones  arc  necessary  in  order  to  determine  the  class 
membership  of  an  object  (.Section  2.3).  As  will 
be  de.scribecl  later  in  Section  3  and  4,  based  on 
this  rule  combinatioii  technique,  the  proposed 
method  can  be  shown  to  perform  better,  both  in 
terms  of  accuracy  and  learning  efficiency,  than 
common  classification  techniques. 

Other  than  combining  low-order  rules  in  the 
above  manner,  high-order  rules  can  also  be 
formed  by  considering  attribute  values  together, 
instead  of  separately,  in  determining  their 
relevancy  for  classification  (Section  2.1).  The 
det.ails  of  how  this  can  be  done  will  be  discussed 
in  a  separate  paper. 

2.3.  Clapsifiratlon  of  Objects 

.Suppose  that  an  object  obj  described  by  n 
char.acteristics  im/,,  •  •  •  ,  I'nl,-,  •  •  '  ,  val^  is 

given.  By  matching  each  attribute  value,  of  obj 
against  e.ach  classification  rule  in  turn,  the  classes 
it  may  be  assigned  to  can  be  determined.  How¬ 
ever,  .since  the  description  of  obj  may  match  par¬ 
tially  with  that  of  more  than  one  class  of  objects, 
it  may  be  classified  into  different  clas.ses  based  on 
its  different  characteristics.  As  is  disc.ns.sed  in  the 
last  section,  since  each  of  the  attribute  values  of 
obj  that  matches  the  clas.sificatlon  rules  can  be 
considered  as  providing  some  evidence  for  or 
against  the  assignment  of  obj  to  those  classes 
predicted  by  the  rules,  its  classification  can  be 
made  based  on  a  measure  that  combines  these 
pieces  of  evidence  together.  Its  value  should 
increase  with  the  strength  and  the  number  of 
pieces  of  positive  evidence  supporting  a  specific 
class  assignment  for  obj  and  decreases  vice,  versa. 
One  of  such  measure  that  po.s.sess  such  property 
is  proposed  here.  It  estimates  quantitatively  the 
various  pieces  of  evidence,  provided  by  vali, 

•  •  •  ,  vail,  in  favor  of  obj  being  classified  into  c„i,; 
as  opposed  to  being  cla-ssified  into  other  classes. 
It  is  defined  as: 

W((V„,,j=r,A,/G„,,,?5c„,,,jtin/i,  •  •  •  ,val„) 
Pr[C„,.j=c„,.j\vali,  ■  •  •  ,val„) 

,  Pi'(CU,^c„,.jlvali,  ■  ■  ■  ,val„) 

""*^8  r*  i  r'  u.  ^  * 

Pr(vali,  •  •  •  ,val„\C„,,j=c,j,j) 

Pr(vali,  ■  ■  ■  ,val„\C.,,,j?ic.j,j) 

.Suppose  that,  of  the  n  attributes  that  describe 
obj,  only  m  (»n<a)  of  tliem,  vaipj,  '  ‘ 

•  •  ■  .  ”<’/(;]  (■  {vnl,\j  =1,  ■  •  •  ,  n},  are 

found  to  match  one  or  more  chassification  rules. 
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then  the  weight  of  evidence  can  be  siinpliried 
into; 

=‘W{C,J,J=c,^,J/C^,^c,J,j\val^xY  •  •  •  (6) 

If  there  is  no  a  priori  knowledge,  concerning  the 
interrelation  of  the  attributes  in  the  problem 
domain,  then  this  weight  provided  by  all  tlte 
attribute  values  of  obj  in  favor  of  it  being 
assigned  to  c„(y  as  opposed  to  being  assigned  to 
other  classes  is  equal  to  the  sum  of  the  weights 
provided  by  each  individual  attribute  value  of  obj 
that  is  relevant  for  classifying  it  to  c,j,j  |1]: 

111 

J-1 

In  brief,  the  classification  strategy  can  be  sum¬ 
marised  as  follows.  Given  an  object  obj,  the  set 
of  classification  rules  is  searched  to  determine 
which  class  descriptions  are  partially  matched  by 
that  of  obj.  If  a  value  satisfies  the  condition  part 
of  a  rule,  the  positive  and  negative  evidence  pro¬ 
vided  by  the  attribute  values  of  obj  are  quantita¬ 
tively  measured  and  combined  for  comparison. 
obj  is  assigned  to  Cj,  when  it  has  the  largest  par¬ 
tial  match,  that  is: 

h=  1,2,  •  •  •  ,  P  and  (8) 

where  P  (<F)  denotes  the  number  of  clas.ses 
that  are  partially  matched  by  the  attribute 
values  of  obj. 

It  should  be  noted  that  if  two  different  plausi¬ 
ble  values  have  the  same  greatest  weight  of  evi¬ 
dence,  there  may  be  more  than  one  plausible 
class  assignment  for  obj.  Furthermore,  if  there  is 
no  evidence  for  or  against  any  specific  class 
assignment,  classification  may  either  be  refrained 
to  avoid  furnishing  of  an  in.accurate  assignment 
or  that  obj  can  be  assigned  to  the  class  to  which 
the  majority  of  training  objects  belong.  If  it  hap¬ 
pens  that  there  is  no  relevant  value  for  determin¬ 
ing  the  class  membership  of  obj,  then  either  the 
training  data  is  completely  non-deterministic  or 
there  is  insufficient  training  instances  for  the 
h  -.rning  process. 

3.  Experimental  Results 

To  evaluate  the  performance  of  the  proposed 
probabilistic  inductive  learning  method  and  com¬ 
pare  it  with  other  classification  methods,  the  fol¬ 
lowing  data  sets  are  used: 


1.  Dermatoglyphic  Data  -  Over  decade.s, 
numerous  articles  have,  been  published  report¬ 
ing  clinical  significance  of  finger  prints  and 
palm  patterns  in  cogenital  disease  identifica¬ 
tion  [4,  15,  17].  For  an  experiment  with  the 
dermatoglyphic  data,  the  finger-print  pat¬ 
terns,  the  aid  angle  and  the  ab  ridge-count  of 
both  the  right  and  left  hands  of  126  subjects 
were  obtained  [17]  (Fig.  1).  These  126  sub¬ 
jects  had  been  divided  into  three  groups.  The 
first  of  these  consisted  of  51  myelomeningocele 
patients  drawn  from  a  spinal  dysfunction 
clinic.  The  second  and  third  groups  consisted 
of  40  and  35  normal  subjects,  respectively, 
that  show  different  dermatoglyphic  patterns 
[17]. 

2.  Medical  Data  I  -  For  further  performance 
evaluation,  we  used  two  sets  of  medical  data 
obtained  from  an  Inten.slve  C!are  Unit  (IGU) 
[9].  The  first  set  consisted  of  120  patient 
records  each  of  which  was  characterized  by  12 
attributes  representing  different  symptoms 
derived  from  the  moiiilotlng  of  the  Central 
Nervous  System,  the  Respiratory  System,  the 
Cardiovascular  System,  the.  Skin  Signs  and 
the  Renal  .System  of  the  patient  |9].  Based  on 
the  attributes,  the  patients  were  classified  into 
two  groups  by  the  physicians  -  those  that  liad 
to  be  placed  under  intensive  care  and  those 
that  could  be  discharged  to  the  main  floor. 
The  records  of  66  patients  were  available  for 
the  first  group  and  54  for  the  second. 

3.  Medical  Data  II  -  The  second  set  of  medical 
d.ata  obtained  from  the  same  source  consist  of 
99  records  characterized  from  the  same  12 
attributes  [9].  Based  on  these,  attributes,  each 
patient  was  diagnosed,  by  a  group  of  physi¬ 
cians,  into  one  of  four  disease  types  [9]:  (i) 
che.st  disea.'ie;  (ii)  abdoiuin.al  disease;  (iii)  car¬ 
diac  disease;  and  (iv)  neurological  disease.  In 
this  experiment,  the  records  of  15  patients 
were  available  for  Group  1,  33  for  Group  2,  25 
for  Group  -3  and  26  for  Group  4. 

For  performance  evaluation,  the  classification 
accuracy  of  the  following  systems  for  the  above 
data  sets  were  determined:  (i)  the  ID3  method 
which  builds  a  decision  tree  for  classification 
based  on  the  TDIDT  (Top-Down  Induction  of 
Decision  Tree)  algorithm  described  in  [10];  (ii) 
the  ID.3  method  with  pre-pruning  which  ter- 
minat'''!  tlip  br.n'.  hing  process  at  a  node  when 
the  attribute  as.!oriaied  with  the  node  is  tested 
(by  the  chi-squ.',re  test)  to  be  statistically 
independent  of  the  clius.'.es  [11];  (iii)  the  ID3 
method  with  post-pruning  which  constructs  a  set 
of  rules  equivalent  to  a  decision  tree  and  then 
simplifies  them  by  testing  each  term  of  each  rule 
against  the  class,  predicted  by  the  conclusion  of 
the  rule,  for  statistical  dependence  (using  Fisher’s 
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Exact  Test)  [12-13];  (iv)  the  Bayesian  Classifier 
which  computes  the  class  conditional  probability 
for  every  class  by  assuming,  due  to  small  sample 
size  for  training,  that  all  the  attributes  are 
independent  and  then  selects  the  class  with  the 
highest  probability;  (v)  the  Default  or  Simple 
Majority  Classifier  which  simply  assigns  the  most 
commonly  occurring  class  in  the  training  set  to 
alt  objects  that  are  to  be  classified,  with  no  refer¬ 
ence  to  their  attributes;  and  (vi)  the  proposed 
method  without  rejection  (i.e,  an  object  is 
assigned  to  the  most  commonly  occurring  class 
when  there  is  not  enough  evidence  supporting  it 
to  be.  assigned  to  any  specific  one).  In  the 
evaluation,  ten  different  randomly  selected  train¬ 
ing  (70%  of  total  available  data)  and  testing  sam¬ 
ples  wcfe  generated  for  each  of  the  three  sets  of 
data.  The  results,  averaged  over  the  ten  ran¬ 
domly  chosen  training  and  testing  samples,  are 
given  in  Table  1. 

As  is  expected,  these  results  indicate  that  the 
classification  accuracy  of  the  Simple  Majority 
method,  which  served  as  a  reference  point  for 
evaluating  the  various  methods,  was  the  poore.st. 
The  Bayesian  Classifier  performed  much  better 
than  the  Simple  Majority  method.  However,  it 
was,,  in  general,  not  as  good  as  the  decision-tree- 
based  algorithms  except  in  the  experiment  with 
the  second  set  of  medical  data. 

Of  the  three  ID3-  or  decision-tree-based 
methods,  the  use  of  a  pre-pruning  strategy  during 
the  construction  of  decision  trees  did  not  seem  to 
improve  the  classification  accuracy.  In  fact,  it 
consistently  made  more  errors  than  the  simple 
ID3-algorithm  in  which  no  pre-pruning  was  per¬ 
formed.  This  was  due  to  the  use  of  the  chi- 
square  test  in  deciding  whether  pruning  at  a  node 
should  be  terininated.  In  order  for  the  assump¬ 
tions  behind  the  chi-square  test  to  be  valid,  a 
very  large  training  set  for  each  class  of  objects 
have  to  be  available.  This  is  because,  as  a  deci¬ 
sion  tree  is  being  built,  the  set  of  training  sam¬ 
ples  is  always  being  subdivided  in  the  hope  that 
no  single  exception  remains.  Therefore,  the 
further  down  the  tree,  the  smaller  is  the  re.sulting 
subset  of  training  sample.  .Since,  according  to 
[11],  the  expected  frequencies  of  each  cell  in  the 
contingency  table  constructed  for  a  chi-square 
test  should  be  greater  than  four,  unavailability  of 
a  large  training  set  for  each  class  often  result  in 
the  branching  process  terminating  too  early. 

Recently,  in  the  statistics  community,  it  is 
discovered  that  the  restriction  for  the  expected 
frequency  in  each  cell  to  be  at  leeist  four  is  too 
conservative.  It  is  suggested  that,  at  least  for 
test  conducted  at  a  confidence  level  of  95%,  the 
minimum  expected  cell  frequencies  can  be 
approximately  1.0  [6].  In  fact,  the  test  for  the 
II)3-algorlthm  with  pre-pruning  was  performed 


under  such  an  assumption.  Unfortunately,  it  did 
not  improve  the  clnssificatiou  accuracy  of  the 
algorithm  by  much. 

As  for  the  ID3  method  without  pruning  and 
that  with  post-pruning,  they  are  comparable  in 
performance  in  the.  sense  that  the  former  is  more 
accurate  in  experiments  with  the  first  set  of  med¬ 
ical  data  but  not  as  accurate  as  in  the.  other 
experiments. 

Of  all  the  six  methods,  the  percentage,  of 
correct  classification  of  the  proposed  method  was 
the  highest  in  all  experiments.  Direct  comparison 
between  proposed  method  and  the  other 
deci.sion-tree  priming  methods  are  yet  unavail¬ 
able.  However,  while  ID3  with  post-pruning  have 
been  shown  to  have  superior  performance  when 
compared  to  the  others  [12],  we  only  compared 
the  proposed  method  with  ID3. 

4.  Discussions  and  Summary 

III  evaluating  the  petformanee  of  a  learning 
method,  it  is  often  ucce.ssary  to  consider  its  deci¬ 
sion  accuracy,  the  amount  of  training  required  to 
achieve  a  specific  level  of  performance,  as  ^vell  as 
how  costly  that  training  is.  One  measure  of 
training  effort  is  the  complexity  of  the.  method 
[3]. 

Let  M  denote  the  size  of  the  training  set,  n 
the  total  number  of  attributes  that  describe  each 
of  the  training  instances.  The  critical  component 
of  the  ID3  algorithm  is  the  proce.ss  of  selecting  a 
test  attribute  on  which  to  branch.  Each  such 
choice  involves  the  following  operations  [3]: 

1.  For  each  attribute,  example  counts  are  put  in 
an  array,  indexed  by  class  and  attribute.  This 
takes  time  0(n»M); 

2.  The  entropy  function  is  calculated  for  each 
attribute,  taking  time  Cl(n); 

3.  Once  the  best  attribute  is  found,  the  examples 
arc  divided  into  J  different  sets  (where  J  is 
the  total  number  of  values  that  the  best  attri¬ 
bute  takes  on).  This  takes  time  0(M). 

Therefore,  the  overall  time  for  a  single  attribute 
choice  at  each  node  in  a  decision  tree  is  0{n»M). 
'I'lie  time  taken  to  construct  the  complete  tree 
depends  very  much  ou  the  structure  of  the  tree. 
In  general,  it  is  0(n«ilf*  (Total  no.  of  nodes  in 
the  tree)). 

As  for  the  ID.J  algorithm  with  pre-pruning, 
the  need  to  test  for  statistical  dependence 
between  each  attribute  and  the  classes  at  each 
node  does  not  affect  the  complexity  for  the  criti¬ 
cal  component  of  the  1D3  algorithm.  In  fact, 
since  a  smaller  subtree  with  fewer  number  of 
nodes  is  usually  obtained  a.'  a  result  of  tree- 
pruning,  the  overall  rate  oi  learning  may  even  be 
improved. 
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The  ID3  algorithm  with  post-pruning  requires 
a  complete  decision  tree  to  be  built  first.  The 
complexity  for  this  process  is  0(n*M«  (Total  no. 
of  nodes  in  the  tree)).  The  second  part  of  the 
post-pruning  algorithm  involves  constructing  a 
set  of  rules  equivalent  to  the  decision  tree.  Each 
path  in  the  tree  is  turned  into  a  rule  with  the 
number  of  conditions  on  the  left-hand  side  of  the 
rule  equals  to  the  path  length.  The  conditions  of 
each  rule  are  then  examined  in  turn  to  deter¬ 
mine,  by  Fisher’s  Exact  Test  for  statistical 
independence  in  two-by-two  contingency  tables, 
the  ones  that  are  relevant  for  the  classification 
process.  Conditions  that  are  irrelevant  are  dis¬ 
carded  from  a  rule.  The  time  complexity  of  this 
process  is,  once  again,  very  much  dependent  on 
the  structure  of  the  tree  and,  therefore,  the  rules 
constructed  from  it.  In  particular,  it  is  depen¬ 
dent  on  the  number  of  rules  and  the  number  of 
conditions  on  their  left-hand  sides.  The  process 
takes  time  0{M*  (Total  no.  of  rules)*  (Average 
no.  of  conditions  per  rule)). 

To  further  simplify  the  set  of  pruned  rules, 
each  of  them  is  omitted  in  turn  so  as  to  deter¬ 
mine  how  well  the  rest  of  the  rules  perform  on 
the  training  instances.  If  there  are  rules  whose 
omission  would  not  lead  to  an  increase  in  the 
number  of  misclassified  instances  or  would  even 
reduce  it,  then  the  least  useful  one  is  discarded 
and  the  process  is  repeated  until  no  such  rule  can 
be  found.  The  time  complexity  of  this  process  is 
dependent  on  the  number  of  rules  that  have  to 
be  deleted  from  the  resulting  rule  set  after  the 
rule  simplification  process.  This  process  lakes 
time  0(Af*  (Total  no.  of  rules)). 

The  time  complexity  of  the  proposed  classifi¬ 
cation  method  to  generate  the  contingency  tables 
to  search  for  significant  adjusted  residuals  is 
0{n»M).  Being  independent  of  the  run  time  of 
the  classification  rule  generated,  this  learning 
method  is  substantially  faster  than  the  decision- 
tree-based  algorithms  whose  efficiency  depends 
very  much  on  the  structure  of  the  decision  trees. 
In  addition,  unlike  the  decision-tree-based  algo¬ 
rithms  in  which  the  basic  operation  is  repeatedly 
applied,  the  basic  operation  of  the  proposed 
method,  i.e.,  the  analysis  of  residuals,  is  per¬ 
formed  only  once.  Since  the  residuals  can  be 
examined  independently  of  each  other,  the  pro¬ 
posed  learning  method  has  been  employed  in  the 
implementation  of  an  artificial  Neural  network 
which  has  the  advantage  of  being  very  fast  in  its 
training  process  as  well  as  being  able  to  provide  a 
facility  for  direct  analysis  of  internal  associations. 
This  network  has  been  tested  on  simulated  and 
real-world  data  with  performance  superior  to  a 
Back  Propagation  Network  [18]. 


In  summary,  we  have  presented  an  efficient 
probabilistic,  inductive  learning  method  for  solv¬ 
ing  clas.sification  problem  even  when  (i)  the  data 
contains  inaccurate,  incomplete  aud  inconsistent 
values;  (ii)  the  assumption  concerning  any 
specific  mathematical  model  for  the  data  cannot 
be  made;  (iii)  the  data  is  of  high  dimensionality 
and  (iv)  the  training  sftmple  sir-e  is  relatively 
small.  The  proposed  method  has  been  imple¬ 
mented  and  tested  using  real-life  data  sets.  The 
test  results  showed  that  the  proposed  method 
con.siste.ntly  outperformed,  in  terms  of  accuracy 
and  computational  efficiency,  many  commonly 
used  methods  that  were  developed  to  handle 
uncertainty  in  classification  tasks. 
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Table  1.  Performance.  (%  Correct)  CJomparison  of  Different  Classification  Methods 


Data 

ID3  without 
pruning 

ID3  with 
pre-pruning 

Classific.ation  Methods 

ID3  with  Bayesian 

post-pruning  Classifier 

Simple 

Majority 

PIT-Based 

APACS 

Dermatoglyphic 

Data 

51.0% 

48.2% 

55.0% 

44.7% 

.37.5% 

68.2% 

Medical  data  I 

89.2% 

82.0% 

85.0% 

79.2% 

51.1% 

91.1% 

Medical  data  II 

79.5% 

59.5% 

87.0% 

81.0% 

35.0% 

94.0% 
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Abstract 

The  performance  of  the  error  backpropaga¬ 
tion  (BP)  and  ID3  learning  algorithms  was 
compared  on  the  task  of  mapping  English 
text  to  phonemes  and  stresses.  Under  the  dis¬ 
tributed  output  code  developed  by  Sejnowski 
and  Rosenberg,  it  is  shown  that  BP  consis¬ 
tently  out-performs  ID3  on  this  task  by  sev¬ 
eral  percentage  points.  Three  hypotheses  ex¬ 
plaining  this  difference  were  explored:  (a) 

ID3  is  overfitting  the  training  data,  (b)  BP 
is  able  to  share  hidden  units  across  several 
output  units  and  hence  can  learn  the  output 
units  better,  and  (c)  BP  captures  statistical 
information  that  ID3  does  not.  We  conclude 
that  only  hypothesis  (c)  is  correct.  By  aug¬ 
menting  ID3  with  a  simple  statistical  learn¬ 
ing  procedure,  the  performance  of  BP  can  be 
approached  but  not  matched.  More  complex 
statistical  procedures  can  improve  the  per¬ 
formance  of  both  BP  and  ID3  substantially. 

A  study  of  the  residual  errors  suggests  that 
there  is  still  substantial  room  for  improve¬ 
ment  in  learning  methods  for  text-to-speech 
mapping. 

1  Introduction 

The  task  of  mapping  English  text  into  speech  is  quite 
difficult  (see  Klatt,  1987).  One  particularly  difficult 
step  involves  mapping  words  (i.e.,  strings  of  letters) 
into  strings  of  phonemes  and  stresses.  In  this  paper, 
we  compare  two  machine  learning  algorithms  applied 
to  the  task  of  learning  this  text-to-speech  mapping. 
We  employ  the  formulation  developed  by  Sejnowski 
and  Rosenberg  (1987)  in  their  widely  known  work  on 
NETTALK. 

Let  L  be  the  set  of  29  symbols  comprising  the  letters 
a-z,  and  the  comma,  space,  and  period  (in  our  data 
sets,  comma  and  period  do  not  appear).  Let  P  be 
the  set  of  54  English  phonemes  and  S  be  the  set  of 
6  stresses  employed  by  Sejnowki  and  Rosenberg.  The 
task  is  to  learn  the  mapping  /  :  L*  — >  P*  x  S*. 
Specifically,  /  maps  from  a  word  of  length  to  a  string 


of  phonemes  of  length  k  and  a  string  of  stresses  of 
length  k.  For  example, 

iC'lollypop")  =  ("lal-ipap",  ">1<>0>2<"). 
Notice  that  letters,  phonemes,  and  stresses  have  all 
been  aligned  so  that  silent  letters  are  mapped  to  the 
silent  phoneme  /-/. 

As  defined,  /  is  a  very  complex  discrete  mapping 
with  a  very  large  range.  If  we  assume  no  word  contains 
more  than  28  letters,  thb  range  would  contain  more 
than  10™  elements.  Existing  learning  algorithms  focus 
primarily  on  learning  Boolean  concepts — that  is,  func¬ 
tions  whose  range  is  the  set  {0,1}.  Such  algorithms 
cannot  be  applied  directly  to  learn  /. 

Forturiately,  Sejnowski  and  Rosenberg  developed  a 
technique  for  converting  this  complex  learning  prob¬ 
lem  into  the  task  of  learning  a  collection  of  Boolean 
concepts.  They  begin  by  reformulating  /  to  be  a  map¬ 
ping  g  from  a  seven-letter  window  to  a  single  phoneme 
and  a  single  stress.  For  example,  the  word  “lollypop” 
would  be  converted  into  8  separate  7-letter  windows: 

g(" _ loll")  =  ("1",  ">") 

g("-_lolly")  =  ("a",  "1") 
g("_lollyp")  =  ("1",  "<") 
gC'lollypo")  =  ">") 

gC'ollypop")  =  ("i",  "0") 
g("llypop_")  =  ("p",  ">") 

gO'lypop _ ")  =  ("a",  "2") 

g("ypop _ ")  =  ("p",  "<") 

The  function  g  is  applied  to  each  of  these  8  windows, 
and  then  the  results  are  concatenated  to  obtain  the 
phoneme  and  stress  strings.  This  mapping  function  g 
now  has  a  range  of  324  possible  phoneme/stress  pairs, 
which  is  a  substantial  improvement. 

Finally,  Sejnowski  and  Rosenberg  code  each  possible 
phoneme/stress  pair  as  a  26-bit  string,  21  bits  for  the 
phoneme  and  5  bits  for  the  stress.  Each  bit  in  the  code 
corresponds  to  some  property  of  the  phoneme  or  stress. 
This  converts  g  into  26  separate  Boolean  functions, 

hi . ^26-  Each  function  hi  maps  from  a  seven-letter 

window  to  the  set  {0, 1).  To  assign  a  phoneme  and 
stress  to  a  window,  all  26  functions  are  evaluated  to 
produce  a  26-bit  string.  This  string  is  then  mapped 
to  the  nearest  of  the  324  bit  strings  representing  legal 
phoneme/stress  pairs.  We  used  the  Hamming  distance 
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between  two  strings  to  measure  distance.  (Sejnowski 
and  Rosenberg  used  the  a-’gle  between  two  strings  to 
measure  distance,  but  they  report  that  the  Euclidean 
distance  metric  gave  similar  results.  In  tests  with  the 
Euclidean  metric,  we  have  obtuned  results  identical  to 
those  reported  in  this  paper.) 

With  this  reformulation,  it  is  now  possible  to  ap¬ 
ply  Boolean  concept  learning  methods  to  learn  the  h,-. 
However,  the  individual  hi  must  be  learned  extremely 
well  in  order  to  obtain  good  performance  at  the  level 
of  entire  words.  This  is  because  errors  aggregate.  For 
example,  if  each  hi  is  learned  so  well  that  it  is  99%  cor¬ 
rect  and  if  the  errors  among  the  hi  are  independent, 
then  the  26-bit  string  will  be  correct  only  77%  of  the 
time.  Because  the  average  word  has  about  7  letters, 
whole  words  will  be  correct  only  16%  of  the  time. 

In  the  remainder  of  this  paper,  we  describe  a  series 
of  experiments  comparing  the  performance  of  the  error 
backpropagation  algorithm  (BP)  to  the  decision-tree 
learning  algorithm  ID3.  We  begin  by  comparing  BP 
and  IDS  on  the  task  described  above.  Having  estab¬ 
lished  that  BP  significantly  outperforms  IDS  on  this 
task,  we  formulate  three  hypotheses  to  explain  this 
difference.  We  test  these  hypotheses  by  performing  ad¬ 
ditional  experiments.  These  experiments  demonstrate 
that  IDS,  combined  with  some  simple  statistical  learn¬ 
ing  procedures,  can  nearly  match  the  performance  of 
BP.  Finally,  we  present  data  suggesting  that  there  is 
sti'l  substantial  room  for  improvement  of  learning  al¬ 
gorithms  for  text-to-speech  mapping. 

2  A  Simple  Comparative  Study 

In  this  study,  IDS  and  BP  were  both  applied  to  the 
learning  task  described  above.  We  begin  by  briefly 
reviewing  these  two  learning  algorithms  and  the  data 
set. 

2.1  The  Algorithms 

IDS  is  a  simple  decision-tree  learning  algorithm  de¬ 
veloped  by  Ross  Quinlan  (198S;  1986b).  The  version 
we  employed  used  the  information  gain  criterion  to 
choose  which  feature  to  place  at  the  root  of  each  deci¬ 
sion  tree  (and  subtree).  We  did  not  employ  windowing 
(Quinlan,  198S),  CHI-square  forward  pruning  (Quin¬ 
lan,  1986a),  or  any  kind  of  reverse  pruning  (Quinlan, 
1987).  We  did  apply  one  simple  kind  of  forward  prun¬ 
ing  to  handle  inconsistencies  in  the  training  data:  If 
all  training  examples  agreed  on  the  value  of  the  chosen 
feature,  then  growth  of  the  tree  was  terminated  in  a 
leaf  and  the  class  having  more  training  examples  was 
chosen  as  the  label  for  that  leaf  (in  case  of  a  tie,  the 
leaf  is  assigned  to  class  0). 

To  apply  IDS  to  this  task,  the  algorithm  must  be 
executed  26  times — once  for  each  mapping  /i,-.  Each 
of  these  executions  produces  a  separate  decision  tree. 
The  seven-letter  window  was  represented  as  the  con¬ 
catenation  of  seven  29-bit  strings.  Each  29-bit  string 
represents  a  letter  (one  bit  for  each  letter,  period, 
comma,  and  blank),  and  hence,  only  one  bit  is  set  to 
1  in  each  29-bit  string.  This  produces  a  string  of  203 
bits  for  each  window. 


The  error  backpropagation  algorithm  (Rumelhart, 
Hinton  &  Williams,  1986)  is  widely  applied  to  train 
artificial  neural  networks.  We  replicated  the  network 
architecture  and  training  procedure  employed  by  Se¬ 
jnowski  and  Rosenberg  (1987).  This  network  is  a  fully- 
connected  feed-forward  network  containing  203  input 
units,  120  hidden  units,  and  26  output  units  (one  for 
each  mapping  hi).  We  employed  the  same  input  and 
output  encodings  described  above. 

Unlike  IDS,  it  is  only  necessary  to  apply  BP  once, 
because  all  26  output  bits  can  be  learned  simultane¬ 
ously.  Indeed,  the  26  outputs  all  share  the  collection  of 
120  hidden  units,  which  may  allow  them  to  be  learned 
more  accurately.  However,  while  IDS  is  a  batch  algo¬ 
rithm  that  processes  the  entire  training  set  at  once, 
BP  is  an  incremental  algorithm  that  makes  repeated 
passes  over  the  data.  Each  complete  pass  is  called  an 
“epoch.”  During  an  epoch,  the  training  examples  are 
inspected  one-at-a-time,  and  the  weights  of  the  net¬ 
work  are  adjusted  to  reduce  the  squared  error  of  the 
outputs.  We  used  a  learning  rate  of  .2.5  and  a  mo¬ 
mentum  term  of  .9.  The  weights  of  the  network  were 
initialized  to  random  values  between  —.3  and  d-.S.  In 
all  cases,  we  trained  for  30  epochs,  since  this  was  the 
training  regime  followed  by  Sejnowski  and  Rosenberg. 
We  used  the  implementation  provided  with  (McClel¬ 
land  and  Rumelhart,  1988). 

Because  the  outputs  from  BP  are  floating  point 
numbers  between  0  and  1,  we  had  to  adapt  the  Ham¬ 
ming  distance  measure  when  mapping  to  the  nearest 
legal  phoneme/stress  pair.  We  used  the  following  dis¬ 
tance  measure:  d(x,y)  =  -  j/,|.  This  reduces 

to  the  Hamming  distance  when  x  and  y  are  Boolean 
vectors. 

2.2  The  Data  Set 

Sejnowski  and  Rosenberg  provided  us  with  a  dictio¬ 
nary  of  20,003  words  and  their  corresponding  phoneme 
and  stress  strings.  This  dictionary  was  randomly  par¬ 
titioned  into  a  testing  set  of  1000  words,  and  a  training 
set  of  19,003  words.  This  training  set  was  further  sub¬ 
divided  to  extract  smaller  training  sets  of  1000,  800, 
400,  200,  100,  and  50  words.  Each  smaller  training  set 
was  extracted  by  randomly  sampling  from  the  next 
larger  set. 

2.3  Results 

Table  1  shows  percent  correct  (over  the  1000-word  test 
set)  as  a  function  of  the  size  of  the  training  set  for 
words,  letters,  phonemes,  and  stresses.  Virtually  ev¬ 
ery  difference  in  the  table  at  the  word,  letter,  phoneme, 
and  stress  levels  is  statistically  significant  (using  the 
test  for  the  difference  of  two  proportions).  Hence,  we 
conclude  that  there  is  a  substantial  difference  in  per¬ 
formance  between  IDS  and  BP  on  this  task. 

To  take  a  closer  look  at  the  performance  difference, 
we  can  study  exactly  how  each  of  the  7,242  windows 
in  the  test  set  are  handled  by  each  of  the  algorithms. 
Table  2  categorizes  each  of  these  windows  according  to 
whether  it  was  correctly  classified  by  both  algorithms, 
by  only  one  of  the  algorithms,  or  by  neither  one. 
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Table  1:  Percent  correct  over  1000- word  test  set 


Sample 

Size 

Method 

Level  of  Aggregation 

(%  correct 

- T 

Word 

Letter 

Fhoneme 

Stress 

J3it  (mean) 

50 

ms 

0.8 

41.5 

60.5 

60.1 

93.1 

BP 

1.8* 

48.4*** 

59.4 

72.9*** 

93.5 

100 

IDs 

2.0 

47.3 

64.1 

65.8 

94.0 

BP 

3.7* 

55.2** 

66.1** 

75.5** 

94.4 

200 

TBs 

4.4 

56.6 

70.5 

72.2 

95.1 

BP 

6.0 

61.4*** 

71.9* 

78.6*** 

95.3 

400 

"IBs 

6.2 

58.7 

73.7 

72.1 

95.5 

BP 

10.5*** 

65.7*** 

76.0*** 

79.9*** 

95.9 

800 

TBs 

9.6 

63.8 

77.8 

?5.6 

96.2 

BP 

12.2* 

68.7*** 

78.9 

80.7*** 

96.3 

1000 

"TBs 

9.6 

65.6 

78.7 

"77:2 

96.4 

BP 

14.7*** 

70.9*** 

h - TF  " 

81.1*** 

81.4*** 

96.6 

Difference  in  the  cell  significant  at  p  <  ,05',  .01^’,  .001"* 


Table  2:  Classification  of  test  set  windows  by  IDS  and 
Backpropagation,  decoding  to  nearest  legal  phoneme 
and  stress. 


Back  Propagation 
Correct  Incorrect 


Correct 


IDS 


Incorrect 


4231 

520 

4751 

907 

1584 

2491 

5138 

2104 

The  table  shows  that  the  windows  correctly  leeirned 
by  BP  do  not  form  a  superset  of  those  learned  by 
IDS.  Instead,  the  two  algorithms  share  4,2S1  cor¬ 
rect  windows,  and  then  each  algorithm  correctly  clas¬ 
sifies  several  windows  that  the  other  algorithm  gets 
wrong.  The  net  result  is  that  BP  classifies  387  more 
windows  correctly  than  does  IDS.  This  shows  that 
the  two  algorithms,  while  they  share  substantial  over¬ 
lap,  have  learned  substantially  different  text-to-speech 
mappings. 

The  information  in  this  table  can  be  summarized  as 
a  correlation  coefficient.  Specifically,  let  Xjdz  {Xbp) 
be  a  random  variable  that  is  1  iff  IDS  (BP,  respectively) 
makes  a  correct  prediction  at  the  letter  level.  In  this 
case,  the  correlation  between  Xjdz  and  is  .5508. 
If  all  four  cells  of  Table  2  were  equal,  the  correlation 
coefficient  would  be  zero. 

A  weakness  of  Table  1  is  that  it  shov.’s  performance 
values  for  one  particular  choice  of  training  and  test 
sets.  We  have  replicated  this  study  four  times  (for 
a  total  of  5  independent  trials).  Table  3  shows  the 
average  performance  of  these  5  runs  (each,  of  course, 
on  a  different  randomly-drawn  1000-word  test  set).  All 
differences  are  significant  below  the  .0001  level  using 
a  t-test  for  paired  differences. 

In  the  remainder  of  this  paper,  we  will  attempt  to 
understand  the  nature  of  the  differences  between  BP 
and  IDS.  Our  main  approach  will  be  to  experiment 


with  modifications  to  the  two  algorithms  that  enhance 
or  eliminate  the  differences  between  them.  All  of  these 
experiments  are  performed  using  only  the  training  set 
and  test  set  from  Table  1. 


3  Three  Hypotheses 

What  causes  the  differences  between  IDS  and  BP?  We 
have  three  hypotheses: 

Hypothesis  1:  Overfitting.  IDS  has  overfit  the 
training  data,  because  it  seeks  complete  consistency. 
This  causes  it  to  make  more  errors  on  the  test  set. 

Hypothesis  2;  Sharing.  The  ability  of  BP  to  share 
hidden  units  among  all  of  the  A,-  may  allow  it  to  reduce 
the  aggregation  problem  at  the  bit  level. 

Hypothesis  3:  Statistics.  The  numerical  parame¬ 
ters  in  the  BP  network  allow  it  to  capture  statistical 
information  that  is  not  captured  by  IDS. 

These  hypotheses  are  neither  exclusive  nor  exhaus¬ 
tive. 

The  following  two  subsections  present  the  experi¬ 
ments  that  we  performed  to  test  these  hypotheses. 


3.1  Tests  of  Hypothesis  1  (Overfitting) 


The  tendency  of  IDS  to  overfit  the  training  data  is 
well  established  in  cases  where  the  data  contain  noise. 
Three  basic  strategies  have  been  developed  for  address¬ 
ing  this  problem:  (a)  criteria  for  early  termination 
of  the  tree-growing  process,  (b)  techniques  for  prun¬ 
ing  trees  to  remove  overfitting  branches,  and  (c)  tech¬ 
niques  for  converting  the  decision  tree  to  a  collection 
of  rules.  We  implemented  and  tested  one  method  for 
each  of  these  strategies.  Table  4  summarizes  the  re¬ 
sults. 


The  first  row  repeats  the  basic  IDS  results  given 
above,  for  comparison  purposes.  The  second  row 
shows  the  effect  of  applying  a  test  (at  the  .90 
confidence  level)  to  decide  whether  further  growth 
of  the  decision  tree  is  statistically  justified  (Quinlan, 
1986a).  As  other  authors  have  reported  (Mooney  et 
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Table  3:  Average  percent  correct  (1000-word  test  set)  over  five  trials. 


Sample 

Size 

Method 

Level  of  Aggregation  (%  correct)  | 

■amia 

^Stress 

1000 

IDs 

BP 

WQQIIIII 

Bun 

bob 

|r£2f||H| 

Table  4:  Results  of  applying  three  overfitting-prevention  techniques. 


Method 

Data  set 

Level  of  Aggregation  (%  correct)  | 

Word 

Letter 

Phoneme 

stress 

9.6 

65.6 

78.7 

96.4 

TEST 

9.1 

64.8 

78.4 

77.1 

96.4 

9.3 

62.4 

76.9 

75.1 

96.1 

■1 

8.2 

65.1 

78.5 

77.2 

96.4 

al.,  1989),  this  hurts  performance  in  the  Nettalk  do¬ 
main.  The  third  row  shows  the  effect  of  applying 
Quinlan’s  technique  of  reduce-error  pruning  (Quinlan, 
1987).  Mingers  (1989)  provides  evidence  that  this  is 
one  of  the  best  pruning  techniques.  For  this  row,  a  de¬ 
cision  tree  was  built  using  the  800-word  training  set, 
and  then  pruned  using  the  additional  words  from  the 
1000-word  training  set  that  do  not  appear  in  the  800- 
word  training  set.  Other  trials  using  words  from  a 
1600-word  training  set  produced  similar  results.  Fi¬ 
nally,  the  fourth  row  shows  the  effect  of  applying  Quin¬ 
lan’s  method  for  converting  a  decision  tree  to  a  collec¬ 
tion  of  rules.  Quinlan’s  method  has  three  steps,  of 
which  we  performed  only  the  first  two.  First,  each 
path  from  the  root  to  a  leaf  is  converted  into  a  con¬ 
junctive  rule.  Second,  each  rule  is  evaluated  to  remove 
unnecessary  conditions.  Third,  the  rules  are  combined, 
and  unnecessary  rules  are  eliminated.  The  third  step 
was  too  expensive  to  perform  on  this  rule  set,  which 
contains  6,988  rules. 

None  of  these  techniques  improved  the  performance 
of  IDS  on  this  task.  This  suggests  that  Hypothesis 
1  is  incorrect:  IDS  is  not  overfitting  the  data  in  this 
domain.  This  makes  sense,  since  the  only  source  of 
“noise”  in  this  domain  is  the  limited  size  of  the  7-letter 
window  and  the  existence  of  a  small  number  of  words 
like  “read”  that  have  more  than  one  correct  pronunci¬ 
ation.  Seven-letter  windov.’s  are  sufficient  to  correctly 
classify  98.5%  of  the  words  in  the  20,003-word  dictio¬ 
nary. 

o  o  A  -i"  n 

To  test  the  sharing  hypothesis,  we  attempted  to  train 
26  independent  networks,  each  having  only  one  out¬ 
put  unit,  to  learn  the  k,  mappings.  If  Hypothesis  2 
is  correct,  then,  because  there  is  no  sharing  among 
these  separate  networks,  we  should  see  a  drop  in  per¬ 
formance  compared  to  the  single  network  with  shared 
hidden  units.  Furthermore,  the  decrease  in  perfor¬ 
mance  should  decrease  the  differences  between  BP  and 
IDS. 


Surprisingly,  we  were  unable  to  train  successfully 
the  separate  networks  to  the  target  error  level  on  any 
training  set  other  than  the  50-word  set.  For  the  100- 
word  training  set,  for  example,  the  individual  networks 
often  converged  to  local  minima  (even  though  the  120- 
hidden-unit  network  had  avoided  these  minima).  This 
shows  that  even  if  shared  hidden  units  do  not  aid  clas¬ 
sification  performance,  they  certainly  aid  the  learning 
process! 

As  a  consequence  of  this  training  problem,  we  are 
able  to  report  results  for  only  the  50-word  training 
set.  Table  5  shows  the  performance  of  these  26  net¬ 
works  on  the  training  and  test  sets.  Performance  on 
the  training  set  is  virtually  identical  to  the  120-hidden- 
unit  network,  which  shows  that  our  training  regime 
was  successful.  Performance  on  the  test  set,  however, 
shows  a  loss  of  performance  when  there  is  no  sharing 
of  the  hidden  units  among  the  output  units.  Hence,  it 
suggests  that  Hypothesis  2  is  at  least  partially  correct. 
However,  examination  of  tl.®  correlation  between  IDS 
and  BP  indicates  that  this  is  wrong.  The  correlation 
between  X/£>3  and  Xjgpi  (i.e.,  BP  on  the  single  net¬ 
work)  is  .5167,  whereas  the  correlation  between  Xidz 
and  Xbp26  is  .4942.  Hence,  the  removal  of  shared 
hidden  units  has  actually  made  IDS  and  BP  less  simi¬ 
lar,  rather  than  more  similar  as  Hypothesis  2  suggests. 
The  conclusion  is  that  sharing  in  backpropagation  is 
important  to  improving  its  performance,  but  it  does 
not  explain  why  IDS  and  BP  are  performing  differ¬ 
ently. 

3.3  Tests  of  Hypothesis  3:  Statistics 

We  performed  three  experiments  to  test  the  third  hy¬ 
pothesis. 

In  the  first  experiment,  we  took  the  outputs  of  the 
back-propagation  network  and  thresholded  them  (val¬ 
ues  >  .5  were  mapped  to  1,  values  <  .5  were  mapped  to 
0)  before  mapping  to  the  nearest  legal  phoneme/stress 
pair.  Table  6  presents  the  results  for  the  1000-word 
training  set. 

The  results  show  that  thresholding  significantly 
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Table  5:  Performance  of  26  separate  networks  compared  with  a  single  network  having  120  shared  hidden  units. 
Trained  on  50-word  training  set,  tested  on  1000-word  test  set. 


Method 

Data  set 

Level  of  Aggregation  (%  correct 

Word 

Letter 

Phoneme 

Stress 

Bit  (mean) 

ja)  IDs 

TEST: 

0.8 

41.5 

60.5 

60.1 

92.6 

(b)  ftp  26  separate  nets 

TRAIN: 

82.0 

97.6 

98.4 

99.2 

99.9 

TEST: 

1.6 

45.0 

56.6 

71.1 

92.9 

(c)  Bp  120  hidden  units 

TESINT 

82.0 

97.4 

98.2 

99.2 

99.9 

TEST: 

1.8 

48.4 

59.4 

72.9 

93.5 

Difference  (b)-(c) 

TRAIN: 

0.0 

+0.2 

+0.2 

0.0 

0.0 

TEST: 

-0.2 

-3.4*** 

-2.8** 

-1.8* 

-0.6 

Difference  (a)-(c) 

TEST: 

-1.0 

-6.9 

+1.1 

-12.8 

-0.9 

Table  6:  Performance  of  backpropagation  with  thresholded  output  values.  Trained  on  1000-word  training  set. 
Tested  on  1000-word  test  set. 


Method 

Data  set 

Level  of  Aggregation  (%  correct 

I— - 

■Mima 

Stress 

(a)  ID3  (legal) 

TEST: 

9.6 

65.6 

78.7 

“77:2 

96.1 

(b)  BP  (legal) 

TEST: 

14.7 

81.1 

81.4 

96.6 

(c)  BP  (thresholded) 

TEST: 

12.1 

67.9 

78.6 

96.6 

-2.5’" 

TTT 

lESilHHil 

drops  the  performance  of  back-propagation.  Indeed, 
at  the  phoneme  level,  the  decrease  is  enough  to  push 
BP  below  IDS.  However,  at  the  other  levels  of  ag¬ 
gregation,  BP  still  out-performs  IDS.  Nevertheless, 
the  results  support  the  hypothesis  that  the  continuous 
outputs  of  the  neural  network  aid  the  performance  of 
BP.  A  comparison  of  correlation  coefficients  confirms 
this.  The  correlation  between  Xios  and  XBPthre$h  is 
.5598  (as  compared  with  .5508  for  Xgp)- 

While  this  experiment  demonstrates  the  importance 
of  continuous  outputs,  it  does  not  tell  us  what  kind 
of  information  is  being  captured  by  these  continuous 
outputs  nor  does  it  reveal  anything  about  the  role  of 
continuous  weights  inside  the  network.  For  this,  we 
must  turn  to  the  other  two  experiments. 

In  the  second  experiment,  we  modified  the  method 
used  to  map  a  computed  26-bit  string  into  one 
of  the  324  strings  representing  legal  phoneme/stress 
pairs.  Instead  of  considering  all  possible  legal 
phoneme/stress  pairs,  we  restricted  attention  to  those 
phoneme/stress  pairs  that  had  been  observed  in  the 
training  data.  Specifically,  we  constructed  a  list  of  ev¬ 
ery  phoneme/stress  pair  that  appears  in  the  training 
set  (along  with  its  frequency  of  occurrence).  During 
testing,  the  26-bit  vector  produced  either  by  ID3  or 
BP  is  mapped  to  the  closest  phoneme/stress  pair  ap¬ 
pearing  in  this  list.  Ties  are  broken  in  favor  of  the 
most  frequent  phoneme/stress  pair.  We  call  this  the 
“observed”  decoding  method,  because  it  is  sensitive  to 
the  phoneme/stress  pairs  (and  frequencies)  observed 
in  the  training  set. 

Table  7  presents  the  results  for  the  1000-word  train¬ 
ing  set  and  compares  them  to  the  previous  tech¬ 
nique  (“legal”)  that  decoded  to  the  nearest  legal 
phoneme/stress  pair.  The  key  point  to  notice  is  that 


this  decoding  method  leaves  the  performance  of  BP 
virtually  unchanged  while  it  substantially  improves  the 
performance  of  ID3.  Indeed,  it  eliminates  a  substantial 
part  of  the  difference  between  ID3  and  BP.  Mooney  et 
al.  (1989),  in  their  comparative  study  of  ID3  and  BP 
on  this  same  task,  employed  a  version  of  this  decod¬ 
ing  technique  (without  the  tie-breaking  by  frequency), 
and  obtained  very  similar  results  when  training  on  a 
set  of  the  808  words  in  the  dictionary  that  occur  most 
frequently  in  English  text. 

An  examination  of  the  correlation  coefficients 
shows  that  “observed”  decoding  increases  the  simi¬ 
larity  between  ID3  and  BP.  The  correlation  between 
Xj j)zoh$tTvtd  Sind  XBPQ^fcrved  Js  .5705  (as  compared 
with  .5508  for  “legal”  decoding).  Furthermore,  “ob- 
terved”  decoding  is  almost  always  monotonically  bet¬ 
ter  (i.e.,  windows  incorrectly  classified  by  “legal”  de¬ 
coding  become  correctly  classified  by  “observed”  de¬ 
coding,  but  not  vice  versa). 

From  these  results,  we  can  conclude  that  BP  was 
already  capturing  most  of  the  information  about  the 
frequency  of  occurrence  of  phoneme/stress  pairs,  but 
that  ID3  was  not  capturing  nearly  as  much.  Hence, 
this  experiment  strongly  supports  Hypothesis  3. 

The  final  experiment  concerning  Hypothesis  3  fo¬ 
cused  on  extracting  additional  statistical  inform.ation 
from  the  training  set.  We  were  motivated  by  Klatt’s 
(1987)  view  that  ultimately  letter-to-phoneme  rules 
will  need  to  identify  and  exploit  morphemes  (i.e., 
commonly-occurring  letter  sequences  appearing  within 
words).  Therefore,  we  analyzed  the  training  data  to 
find  all  letter  sequences  pf  length  1,  2,  3,  4,  and  5,  and 
retained  the  200  most-frequently-occurring  sequences 
of  each  length.  For  each  retained  letter  sequence,  we 
formed  a  list  of  all  phoneme/stress  strings  to  which 
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Table  7:  Effect  of  “observed”  decoding  on  learning  performance. 


Method 

Data  set 

Level  of  Aggregation 

(%  correct 

— T 

Word 

Letter 

Phoneme 

Stress 

bit  (mean) 

(a)  iDs  (legal) 

"TEST: 

9.6 

65.6 

TO 

*77:2 

96.1 

(b)  BP  (legal) 

TEST: 

14.7*** 

70.9*** 

81.1*** 

81.4*** 

96.6 

(c)  IDS  (observed) 

TEST: 

13.0 

70.1 

81.5 

79.2 

96.4 

(d)  BP  (observed) 

TEST: 

14.9*** 

71.6* 

81.8 

81.4*** 

96.7 

iDs  Improvement:  (c)-(a) 

TEST: 

3.4*” 

4.5^’*  “ 

2.8*” 

2.0” 

0.3 

BP  Improvement:  (d)-(b) 

TEST: 

0.2 

0.9 

0.7 

0.0 

0.1 

that  sequence  is  mapped  in  the  training  set  (and  their 
frequencies).  For  example,  here  are  the  five  pronunci¬ 
ations  of  the  letter  sequence  “ATION"  in  the  train¬ 
ing  set  (Format  is  ({phonemeciring)  {stressstring) 
{frequency))). 

(("eS-xn"  "1>0«"  22) 

("CS-xn"  ''1<0«"  1) 

O'sS-xn"  "2>0«"  1) 

("CS-xn”  "2<0»"  1) 

("CS-xn"  "1<0»"  1)) 

During  decoding,  each  word  is  scanned  (from  left  to 
right)  to  see  if  it  contains  one  of  the  “top  200”  letter 
sequences  of  length  k  (varying  k  from  5  down  to  1). 
If  a  word  contains  such  a  sequence,  it  is  mapped  and 
decoded  as  follows.  First,  each  of  the  k  windows  in  the 
sequencers  evaluated  and  the  results  concatenated  to 
obtain  a  bit  string  of  length  fc -26.  Then,  this  bit  string 
is  mapped  to  the  nearest  of  the  bit  strings  observed 
for  this  sequence  in  the  training  set  (ties  are  broken 
in  favor  of  the  more-frequently  occurring  bit  string). 
After  decoding  a  block,  control  skips  to  the  end  of  the 
matched  fc-letter  sequence  and  resumes  scanning  for 
another  “top  200”  letter  sequence  of  length  k.  After 
this  scan  is  complete,  the  parts  of  the  word  that  have 
not  yet  been  matched  are  re-scanned  to  look  for  blocks 
of  length  k  —  1.  We  call  this  technique  “block”  decod¬ 
ing. 

Table  8  shows  the  performance  results  on  the  1000- 
word  test  set.  Block  decoding  significantly  improves 
both  ID3  and  BP,  but  again,  IDS  is  improved  much 
more  (especially  below  the  word  level).  Further¬ 
more,  the  correlation  coefficient  between  XiozhUck 
and  XBPbiock  is  .6747,  which  is  a  substantial  increase 
compared  to  .5508  for  legal  decoding.  Hence,  block 
decoding  also  makes  the  performance  of  IDS  and  BP 
much  more  similar. 

Curiously,  these  summary  numbers  hide  substantial 
shifts  in  performance  caused  by  block  decoding.  To 
deinoiislrate  this,  consider  that  there  is  only  a  .7153 
correlation  between  XjDsUgai  and  Xiozbioek-  This  re¬ 
flects  the  fact  that  while  “block”  decoding  gains  7S6 
windows  previously  misclassified  by  “legal”  decoding, 
it  also  lc«es  181  windows  that  were  previously  correctly 
classified  by  “legal”  decoding.  Similarly,  there  is  only  a 
.7746  correlation  between  XBPUgai  and  XBPbtock  (re¬ 
flecting  a  gain  of  4SS  and  a  loss  of  226  windows). 

The  conclusion  we  draw  is  that  block  decoding  fur¬ 
ther  reduces  the  differences  between  IDS  and  BP,  and 


hence  that  this  experiment  also  supports  Hypothesis 
S.  The  experiment  suggests  that  the  block  decoding 
technique  is  a  useful  adjunct  to  any  learning  algorithm 
applied  in  this  domain.  It  also  suggests  that  the  per¬ 
formance  of  block  decoding  could  be  improved  if  some 
way  could  be  found  to  avoid  losing  windows  that  were 
correctly  classified  without  block  decoding.  One  tech¬ 
nique  we  are  exploring  is  to  combine  the  constraints  of 
blocks  that  overlap. 

4  Discussion 

The  results  shown  in  previous  sections  demonstrate 
that  IDS  and  BP,  while  they  attain  similar  levels  of 
performance,  still  do  not  cover  the  same  set  of  testing 
examples.  In  particular,  an  analysis  of  the  7,242  7- 
letter  windows  in  the  test  set  reveals  that  there  are  917 
windows  that  are  incorrectly  classified  by  one  of  the 
algorithms  and  correctly  classified  by  the  other.  This 
suggests  that  an  inductive  learning  algorithm  should 
be  able  to  label  correctly  all  of  these  917  windows. 
This  would  yield  a  performance  of  79.9%  at  the  letter 
level,  which  would  be  quite  good. 

Other  directions  for  improving  these  algorithms  in¬ 
clude  (a)  design  of  better  error-correcting:  codes,  (b) 
block  decoding  using  overlapping  blocks,  (c)  direct 
analysis  of  the  training  set  to  identify  morphemes. 

Klatt  (1987)  points  out  three  properties  of  the  do¬ 
main  that  present  special  challenges  to  inductive  learn¬ 
ing  methods: 

(1)  the  considerable  extent  of  letter  con¬ 
text  that  can  influence  stress  patterns  in  a 
long  word  (and  hence  affect  vowel  quality  in 
words  like  “photograph/photography”),  (2) 
the  confusion  caused  by  some  letter  pairs, 
like  CH,  which  function  as  a  single  letter  in 
a  deep  sense,  and  thus  misalign  any  relevant 
letters  occurring  further  from  the  vowel,  and 
(3)  the  difficulty  of  dealing  with  compound 
words  (such  as  “houseboat”  with  its  silent 
“e”),  i.e.,  compounds  act  as  if  a  space  were 
hidden  between  two  of  the  letters  inside  the 
word. 

Long-distance  intereictions  pose  a  difficult  problem 
for  BP,  since  capturing  them  presumably  requires  a 
very  wide  window.  This  in  turn  requires  a  very  large 
network  with  many  weights,  and  these  will  be  much 
more  difficult  and  time-consuming  to  train.  IDS  scales 
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Table  8:  Effect  of  “block”  decoding  on  learning  performance. 


Method 

Data  set 

Level  of  Aggregation  (%  correct 

mvimm 

Stress 

■iiUfuSDI 

(a)  IDS  (legal) 

TEST: 

78.7 

“771 

96.1 

IO|| 

TEST; 

14.7*** 

81.1**’ 

81.4*** 

96.6 

(c)  IDS  (block) 

TEST: 

17.5 

73.2 

83.8 

96.7 

(d)  BP  (block) 

TEST: 

19.9*** 

73.8 

83.9 

81.5* 

96.7 

■MJinraraEisfflia™ 

TEST: 

W3SM 

BSHI 

mSmKm 

very  well  as  the  number  cf  irrelevant  features  grows, 
so  it  should  be  able  to  handle  much  wider  windows 
without  problems. 

General  solutions  to  the  other  two  problems  men¬ 
tioned  by  Klatt  appear  to  be  quite  challenging.  In¬ 
deed,  machine  learning  techniques  have  some  distance 
to  go  before  they  match  the  performance  of  the  best 
human-engineered  rule  sets.  !C!.^tt  cites  Bernstein  and 
Pisoni  (1980)  as  measuring  tb.  performance  of  their 
rule  system  to  be  97%  at  the  pi  oneme  level.  By  com¬ 
parison,  IDS  trained  on  19,003  words  and  tested  using 
“block”  decoding  is  91.5%  correct  at  the  phoneme  level 
(i.e.,  an  error  rate  nearly  three  times  as  bad). 

5  Conclusions 

The  relative  performance  of  IDS  f-od  Backpropaga- 
tion  on  the  text-to-speech  task  depeii  i  i  on  the  decod¬ 
ing  technique  employed  to  convert  tht  ■  26  bits  of  the 
Sejnowski/Rosenberg  code  into  phoneme/stress  pairs. 
Decoding  to  the  nearest  legal  phoneme/stress  pair  (the 
most  obvious  approach)  reveals  a  substantial  difference 
in  the  performance  of  the  two  algorithms. 

Experiments  investigated  three  hypotheses  concern¬ 
ing  the  cause  of  this  performance  difference. 

The  first  hypothesis — that  IDS  was  overfitting  the 
training  data — was  shown  to  be  incorrect.  Three  tech¬ 
niques  that  avoid  overfitting  were  applied,  and  none  of 
them  improved  IDS’s  performance. 

The  second  hypothesis — that  the  ability  of  luck 
propagation  to  share  hidden  units  was  a  factor — was 
shown  to  be  only  partially  correct.  Sharing  of  hid¬ 
den  units  does  improve  the  classification  performance 
of  backpropagation  and — perhaps  more  importantly — 
the  learning  performance  of  the  gradient  descent 
search.  However,  an  analysis  of  the  kinds  of  errors 
made  by  IDS  and  backpropagation  (with  or  without 
shared  hidden  units)  demonstrated  that  these  were  dif¬ 
ferent  kinds  of  errors.  Hence,  eliminating  shared  hid¬ 
den  units  does  not  produce  an  algorithm  that  behaves 
like  IDS. 

The  third  hypothesis — that  backpropagation  was 
capturing  statistical  information  by  some  mechar 
nism  (perhaps  the  continuous  output  activations) — 
was  demonstrated  to  be  the  primary  difference  be¬ 
tween  IDS  eind  BP.  By  adding  the  “block”  decod¬ 
ing  technique  to  IDS,  the  level  of  performance  of  the 
two  algorithms  in  classifying  test  cases  becomes  sta^ 
tistically  indistinguishable  (at  the  letter  and  phoneme 


levels).  Consequently,  in  tasks  similar  to  the  text- 
to-speech  learning  task,  IDS  with  block  decoding  is 
clearly  the  algorithm  of  choice — pMticularly  for  initial 
exploratory  studies,  where  its  speed  is  a  tremendous 
advantage. 
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Abstract 

This  paper  presents  an  approach  to  concept  learn¬ 
ing  from  inconsistent  data  that  foregoes  a  solution 
to  the  full-blown  problem  and  instead  considers  a 
subcase,  called  bounded  inconsistency.  Data  arc 
said  to  have  bounded  inconsistency  when  some 
small  perturbation  to  the  description  of  any  bad 
instance  will  result  in  a  good  instance.  The  key 
idea  to  learning  in  the  presence  of  bounded  in¬ 
consistency  is  to  not  only  consider  concept  defi¬ 
nitions  that  correctly  classify  all  the  training  data, 
but  also  those  that  miss  some  of  the  data  by  only  a 
small  amount  The  approach  is  implemented  us¬ 
ing  a  generalization  of  Mitchell’s  version-space 
approach  to  concept  learning. 

1  Introduction 

The  problem  of  inductive  concept  learning — forming  gen¬ 
eral  rules  from  specific  cases— has  been  well-studied  in  ma¬ 
chine  learning  and  artificial  intelligence.  In  the  simplified 
case  that  is  often  studied  the  concept  learning  problem  is 
to  find  some  general  description  in  a  concept  description 
language  that  covers  all  given  positive  examples  of  an  un¬ 
known  concept,  and  includes  no  negative  examples  of  the 
concept.  However,  in  real-world  applications  data  are  of¬ 
ten  subject  to  error,  and  there  may  be  no  concept  that  cor¬ 
rectly  classifies  all  the  data.  When  data  are  inconsistent, 
the  learner  will  be  unable  to  find  a  description  classifying 
all  instances  correctly.  In  general,  learning  systems  must 
generate  reasonable  results  even  when  there  is  no  concept 
consistent  with  all  the  data. 

Much  of  the  past  work  on  learning  from  inconsistent 
data  forms  concept  definitions  that  perform  well  but  not 
perfectly  on  the  data,  viewing  those  instances  not  cov¬ 
ered  as  anomalous  (e.g.,  [Michalski  and  Larson,  1978; 
Quinlan,  1986]).  Instances  are  effectively  thrown  away, 
even  if  they  ^e  merely  slightly  errant.  The  approach  taken 
here  to  learning  from  inconsistent  data  is  to  ferege  a  solu¬ 
tion  to  the  full  problem,  and  instead  solve  a  subcase  of  the 
problem  for  one  particular  class  of  inconsistency  that  can 
be  exploited  in  learning.  The  underlying  assumption  for 
this  class  of  inconsistency,  called  bounded  inconsistency,  is 
that  some  small  perturbation  to  the  description  of  any  bad 
instance  will  result  in  a  good  instance.  When  this  '.  tru.;. 


a  learning  system  can  search  through  the  space  of  concept 
definitions  that  correctly  classify  either  the  original  data,  or 
small  perturbations  of  the  data.  The  definition  that  does  best 
can  be  taken  as  the  result  of  learning. 

This  is  the  approach  taken  here,  and  is  implemented  us¬ 
ing  a  generalization  of  Mitchell’s  [1978]  version-space  ap¬ 
proach  to  concept  learning.  Mitchell  defines  a  version  space 
to  be  the  set  of  ^1  concept  definitions  in  a  prespecified  lan¬ 
guage  that  correctly  classify  training  data— the  positive  and 
negative  examples  of  the  unknown  concept.  The  gener¬ 
alized  ^proach  [Hirsh,  1989b;  1990],  called  incremental 
version-space  merging,  removes  the  assumption  that  there 
is  always  some  concept  definition  that  correctly  classifies 
all  the  given  data. 

The  papCT  begins  with  a  description  of  bounded  incon¬ 
sistency,  Ac  form  of  incon.tistency  considered  by  Ais  pa¬ 
per.  The  paper  continues  wiA  Ae  general  solution  to  this 
problem,  followed  by  its  implemenAtion  wiA  incremental 
version-space  merging.  Experimental  results  arc  then  pre¬ 
sented,  followed  by  an  overview  of  related  work  and  a  gen¬ 
eral  discussion.  A  formal  analysis  of  how  Ae  quality  of 
results  is  influenced  by  Ae  amount  of  data  used  in  learn¬ 
ing  concludes  Ae  paper.  FurAer  details  are  presented  else¬ 
where  [Hirsh,  1989b]. 

2  Bounded  Inconsistency 

This  paper  addresses  Ae  problem  of  learning  from  incon¬ 
sistent  data  by  solving  a  subcase  of  Ae  problem  called 
bounded  inconsistency.  The  underlying  assumption  for  Ais 
class  of  inconsistency  is  Aat  some  small  perturbation  to  Ae 
description  of  any  bad  instance  will  result  in  a  good  in¬ 
stance.  Whenever  an  instance  is  misclassified  wiA  respect 
to  Ae  desired  final  concept  definition,  some  nearby  instance 
description  has  Ae  original  instance’s  classification. 

Figure  1  shows  a  simple  way  to  view  Ais.  Concepts 
(such  as  C)  divide  Ae  set  of  instances  (J)  into  positives  and 
negatives.  /,■*■  is  an  example  of  a  represen  Ative  positive  ex¬ 
ample.  It  is  correctly  classified  wiA  respect  to  Ae  desired 
concept  C.  Similarly.  is  a  correctly  classified  represen- 
Ative  negative  example.  however  is  incorrectly  classi¬ 
fied  as  positive  even  Aough  Ae  desired  concept  would  label 
it  negative.  However,  a  neighboring  instance  description, 
J'^,  is  near  it  across  Ae  border,  and  is  correctly  classified. 
Similarly  for  Ae  incorrectly  classified  negative  instance 
and  iA  neighbor  / '4.  Roughly  speaking,  if  misclassifica- 
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tions  only  occur  near  the  desired  concept’s  boundary,  the 
data  have  bounded  inconsistency. 


I 


Figure  1:  Pictorial  Representation  of  Bounded  Inconsis¬ 
tency. 

For  example,  if  instances  are  described  by  conjunctions 
of  features  whose  values  are  determined  by  measuring  de¬ 
vices  in  the  real  world,  and  the  measuring  devices  are 
known  to  only  be  accurate  within  some  tolerance,  bounded 
inconsistency  can  occur.  Consider  an  instance  that  is  classi¬ 
fied  as  positive,  with  a  feature  whose  value  is  5.0  (obtained 
by  the  real-world  measuring  device).  If  the  tolerance  on  the 
measurement  of  the  feature’s  value  is  0. 1 ,  the  instance  could 
really  have  been  one  with  feature  value  4.9.  If  the  "true" 
positive  instance  were  4.9,  and  the  instance  that  really  has 
value  5.0  would  have  been  negative,  a  misclassification  er¬ 
ror  has  occuned.  If  the  tolerance  in(ormation  is  correct,  for 
every  incorrect  instance  there  is  a  neighboring  correct  in¬ 
stance  description,  all  of  whose  feature  values  are  no  more 
than  the  tolerance  away  from  the  original  instance’s  value 
for  that  feature.  This  is  an  example  of  bounded  inconsis¬ 
tency. 

3  Approach 

The  approach  taken  to  solve  this  problem  of  learning  from 
data  with  bounded  inconsistency  is  most  easily  described 
through  the  classic  viev.'  of  concept  learning  as  search  [Si¬ 
mon  and  Lea,  1974;  Mitchell,  1978;  1982],  namely  that  the 
goal  of  learning  is  to  determine  some  concept  definition  out 
of  a  space  of  possible  definitions  as  the  desired  result  of 
learning.  For  consistent  data  this  problem  is  simply  one  of 
finding  a  description  that  correctly  classifies  all  the  data. 
When  data  are  inconsistent,  however,  there  is  no  such  defi¬ 
nition.  The  approach  taken  here  is  to  select  concept  defini¬ 
tions  that  correctly  classify  either  all  instances  or,  for  those 
it  does  not,  some  other  object  in  the  instance  language  near 
the  missed  instance.  Each  instance  is  effectively  “blurred,” 
and  the  concept  definitions  to  be  considered  are  those  that 
generate  a  correct  classification  for  at  least  one  instance  in 
each  “blur.” 

Note,  however,  that  in  general  there  will  be  many  possi¬ 
ble  definitions  that  correctly  classify  some  instance  in  each 
blur.  The  preferred  result  of  learning  is  the  description  that 
requires  the  fewest  instances  to  be  blurred,  that  is,  those 


concept  definitions  that  correctly  classify  as  much  of  the 
data  as  possible.' 

4  Implementation 

The  method  used  to  implement  this  approach  is  based  on 
version  spaces  [Mitchell,  19781.  However,  version  spaces 
assume  the  data  to  be  noise-free,  and  therefore  some  modifi¬ 
cation  of  the  method  is  necessary.*  The  approach  udeen  here 
is  to  use  incremental  version-space  merging  [Hirsh,  1989b; 
1990],  a  generalization  of  version  spaces  [Mitchell,  1978] 
that  removes  its  assumption  of  strict  consistency  with  data. 
A  version  space  is  generalized  to  be  any  set  of  concept  def¬ 
initions  in  a  concept  description  language  representable  by 
boundary  sets.*  Tlie  key  observation  is  that  concept  learn¬ 
ing  can  be  viewed  as  the  two-step  process  of  specifying 
sets  of  relevant  concept  definitions  and  intersecting  these 
sets.  For  each  piece  of  information  obtained — typically  an 
instance  and  its  classification — incremental  version-space 
merging  forms  the  version  space  containing  all  concept  def¬ 
initions  that  are  potentially  relevant  given  the  information 
(determined  as  appropriate  for  the  given  learning  task).  The 
resulting  version  space  is  then  intersected  with  the  version 
space  based  on  all  past  data.  This  intersection  takes  place 
in  boundary-set  form  (using  the  version-space  merging  al¬ 
gorithm  [Hirsh,  1989b;  1990]),  and  yields  thebouhd^-set 
representation  for  a  new  version  space  that  reflects  all  the 
past  data  plus  the  new  information. 

The  general  algorithm  proceeds  as  follows: 

1.  Form  the  version  space  for  the  new. piece  of 
infonnation. 

2.  Intersect  this  version  space  with  the  version 
space  generated  from  past  information. 

3.  Return  to  the  first  step  for  the  next  piece  of 
information. 

The  initial  version  space  contains  all  concept  descriptions  in 
the  language,  and  is  bounded  by  the  5-set  that  contains  the 
empty  concept  that  says  nothing  is  an  exaniple,  and  the  G- 
set  that  contains  the  universal  concept  that  says  everything 
is  an  example. 

Use  of  incremental  version-space  merging  requires  a 
specification  of  hov;  the  individual  version  spaces  should  be 
formed  in  the  first  step  for  each  iteration.  For  example,  us¬ 
ing  simple  consistency  with  instances  (i.e.,  forming  the  ver¬ 
sion  space  of  all  concept  definitions  that  correctly  classify 
the  current  instance)  results  in  an  emulation  of  Mitchell’s 
[1978]  candidate-elimination  algorithm.  This  and  other 
examples  are  presented  elsewhere  [Hirsh,  1989b;  1989a; 
1990],  as  are  further  details  of  the  generalized  version-space 

'This  is  only  one  possible  criterion  for  selecting  a  concept. 
Other  criteria  might  consider  the  amount  of  “biurring”  required 
to  get  an  instance  that  matches  the  concept,  and  select  the  concept 
that  minimizes  the  sum  (or  some  other  function)  of  the  “blurs.” 

^However,  Mitchell  [1978]  describes  one  possible  way  to  ex¬ 
tend  his  technique  to  learn  from  inconsistent  data.  This  is  dis¬ 
cussed  in  Section  6. 

^The  boundary  sets  5  and  G  contain  the  most  specific  and  gen¬ 
eral  concept  definitions  in  the  set.  These  bound  the  set  of  all  con¬ 
cept  definitions  in  the  version  space — the  version  space  contains 
all  concept  definitions  as  or  more  general  than  some  element  in  S 
and  as  or  more  specific  than  some  element  in  G. 
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approach  (including  a  discussion  of  the  computational  com¬ 
plexity  of  the  method).  The  key  idea  in  this  work  is  to  in¬ 
clude  in  the  instance  version  space  concept  definitions  that 
are  consistent  with  either  the  instance  or  some  other  term  in 
the  instance  language  near  the  given  instance  (where  prox¬ 
imity  is  defined  as  appropriate  for  the  given  learning  task). 
The  version  space  of  concept  definitions  to  be  considered 
for  each  instance  will  then  be  those  concept  definitions  con¬ 
sistent  with  the  instance  or  one  of  its  neighbors  within  the 
region.  The  net  effect  is  that  all  instances  are  “blurred,”  and 
version  spaces  reflect  all  instances  within  the  blur. 

The  general  technique  can  be  viewed  as  follows; 

Given: 

•  Training  Data:  Positive  and  negative  exam¬ 
ples  of  the  concept  to  be  identified. 

•  Definition  of  Nearby:  A  method  that  deter¬ 
mines  all  instances  near  a  given  instance. 

•  Concept DescriptionLanguage:  A  language 
in  which  the  final  concept  definition  must  be 
expressed. 

Determine: 

•  A  set  of  concept  definitions  in  the  con¬ 
cept  description  language  consistent  with 
the  data  or  nearby  neighbors  of  the  data. 

The  method  proceeds  as  follows: 

1.  (a)  Determine  the  set  of  instances  near  a 

given  instance. 

(b)  Form  the  version  space  of  all  con¬ 
cept  definitions  consistent  with  some  in¬ 
stance  in  this  set. 

2.  Intersect  this  version  space  with  the  version 
space  generated  from  all  past  data. 

3.  Return  to  the  first  step  for  the  next  instance. 

If  an  instance  is  positive  the  version  space  of  all  concept 
definitions  consistent  with  some  instance  Ln  the  set  of  neigh¬ 
boring  instances  has  as  an  5-set  the  set  of  most  specific  con¬ 
cept  definitions  that  cover  at  least  one  of  the  instances,  and 
as  a  G-set  the  universal  concept  that  includes  everything.  If 
the  single-representation  trick  holds  (i.e.,  for  each  instance 
there  is  a  concept  definition  whose  extension  only  contains 
thatinstance)  [Dietteiichera/.,  1982],  the  5-set  contains  the 
set  of  neighboring  instances  themselves.  If  an  instance  is 
negative  the  5-set  contains  the  empty  concept  that  includes 
nothing  and  the  G-set  contains  all  minimal  specializations 
of  the  universal  concept  that  excludes  the  instance  or  one  of 
its  neighbors. 

4.1  .Searching  the  Version  Space 

Ideally  the  result  of  this  learning  process  would  be  a  sin¬ 
gleton  version  space  containing  the  desired  concept  defini¬ 
tion.  However,  if  not  given  enough  data  the  final  version 
space  will  have  more  than  one  concept  definition.  This  also 
happens  if  the  definition  of  nearby  is  too  generous,  since 
it  will  allow  too  many  concept  definitions  into  the  version 
space,  and  no  set  of  instances  will  permit  convergence  to  a 
single  concept  definition.  The  definition  of  nearby  should 
be  generous  enough  to  guarantee  that  the  desired  concept 
definition  is  never  thrown  out  by  any  instance,  but  not  too 


generous  to  include  too  many  things  (or  in  the  worst  case, 
everything). 

In  addition,  often  it  will  take  too  long  to  wait  for  enough 
instances  to  lead  to  convergence  to  a  single  concept  defini¬ 
tion.  As  each  instance  throws  away  candidate  concept  def¬ 
initions,  the  version  space  gets  smaller  and  smaller.  As  the 
version  space  decreases  in  size,  the  probability  that  a  ran¬ 
domly  chosen  instance  will  make  a  difference — will  be  able 
to  remove  candidate  concept  definitions — becomes  smaller 
and  smaller.  The  more  data  processed,  the  longer  the  wait 
for  another  useful  instance.  Therefore  it  will  sometimes  be 
desirable  due  to  time  considerations  to  use  the  small  but 
nonsingleton  version  space  (before  converging  to  a  single 
concept)  to  determine  a  usable  result  for  learning. 

Thus  a  situation  can  arise  in  which  the  final  version  space 
after  processing  data  has  multiple  concept  definitions.  Not 
all  of  the  remaining  concept  definitions  are  equal,  though. 
Some  may  be  more  consistent  with  neighboring  instances 
than  given  instances — for  the  concept  definition  to  be  con¬ 
sistent  with  ground  data  more  instances  require  “blurring.” 
An  additional  technique  is  therefore  used  to  obtain  a  single 
final  concept  definition:  When  a  nonsingleton  final  version 
space  is  generated,  all  candidate  concept  definitions  in  the 
version  space  are  checked  for  their  coverage  of  the  raw,  un¬ 
perturbed  data.^  The  concept  definition  with  best  coverage 
is  then  selected  as  the  final  generalization.  This  is  computa¬ 
tionally  feasible  as  long  as  the  version  space  is  reasonably 
small  in  size. 

5  Example 

lb  demonstrate  this  technique  Fisher’s  iris  data  [Fisher, 
1936]  is  used.  In  addition  to  providing  a  test  of  the  quality 
of  the  technique,  it  has  been  in  use  for  SO  years,  and  thus 
permits  a  comparison  to  the  results  of  other  techniques. 

5.1  Problem 

The  particular  problem  is  that  of  classifying  examples  of 
different  kinds  of  iris  flowers  into  one  of  three  species  of 
irises:  setosa,  versicolor,  and  viginica.  The  goal  is  to 
learn  three  nonoverlapping  concept  definitions  that  cover 
the  space  of  all  irises;  this  requires  a  slight  extension  to 
the  version-space  approach,  which  is  described  in  the  next 
subsection.  There  are  150  instances,  50  for  each  class;  in¬ 
stances  are  described  using  four  features:  sepal  width,  sepal 
length,  petal  width,  and  petal  length.  The  units  for  all  four 
are  centimeters,  measured  to  the  nearest  millimeter.  For  ex¬ 
ample,  one  example  of  setosa  had  sepal  length  4.6cm,  sepal 
width  3.1cm,  petid  length  1.5cm,  and  petal  width  0.2cm. 

The  concept  description  language  was  chosen  to  be  con¬ 
junctions  of  ranges  of  the  form  a  <  ®  <  6  for  each  fea¬ 
ture,  where  a  and  b  are  limited  to  multiples  of  8  millimeters. 
An  example  of  a  legal  concept  description  has  [0.8cm  < 
petal  length  <  2.4cm]  and  ^tal  width  <  1.6cm].  The 
range  for  defining  neighboring  instances  was  taken  to  be 
3  millimeters  for  each  feature — that  is,  as  much  as  3  mil¬ 
limeters  can  be  added  to  or  subtracted  from  each  feature 


^is,  of  course,  requires  retaining  all  data  for  this  later  check 
of  coverage.  An  alternative  strategy  would  be  to  retain  only  some 
subset  of  the  data  to  lessen  space  requirements. 
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value  for  each  instance  (defining  a  range  of  size  6  millime¬ 
ters  centered  on  each  feature  value).  There  is  no  restriction 
on  the  number  of  features  that  may  be  blurred — ^anywhere 
from  all  to  none  may  require  blurring. 

Note  that  although  this  means  that  each  instance  could 
be  blurred  to  be  any  of  an  infinite  number  of  instances 
within  the  range  specified  by  the  nearness  metric  (or  if  val¬ 
ues  are  limited  to  the  nearest  millimeter,  feature  values  can 
be  blurred  to  any  of  a  large  number  of  nearby  values) ,  many 
of  the  instances  are  equivalent  with  respect  to  the  concept 
description  language.  Two  feature  values,  although  differ¬ 
ent,  can  still  fall  in  the  same  range  imposed  by  the  con¬ 
cept  description  language.  Thus  only  a  much  smaller  set  of 
nearby  instances  need  be  considered  and  enumerated,  one 
from  each  grouping  of  values  imposed  by  the  concept  de¬ 
scription  language. 

5.2  Method 

The  general  approach  of  Section  4  was  used  to  find  rules. 
All  neighboring  instances  for  each  instance  are  generated, 
by  perturbing  the  instance  in  all  ways  possible— 0.3  is 
added  to  and  subtracted  from  each  feature  value,  and  the 
concept  definitions  consistent  with  each  combination  of  po¬ 
tential  feature  values  were  formed.  The  union  of  all  these 
concept  definitions  forms  the  version  space  for  individual 
instances.  For  example,  the  positive  instance  of  setosa 
given  earlier  (with  sepal  length  4.6cm,  sepal  width  3.1cm, 
petal  length  1.5cm,  and  petal  width  0.2cm),  has  four  el¬ 
ements  in  its  5-set.  All  have  [2.8cm  <  sepal  width  < 
3.6cm]  and  [0.0cm  <  petal  width  <  0.8cm].  They 
differ,  however,  on  their  restrictions  on  sepal  length  and 
petal  length:  the  different  concept  definitions  correspond 
to  the  four  different  combinations  obtainable  by  choosing 
one  of  [4.0cm  <  sepal  length  <  4.8cm]  and  [4.8cm  < 
sepal  length  <  5.6cm],  and  one  of  [0.8cm  <  petal  length  < 
1.6cm]  and  [1.6cm  <  petal  length  <  2.4cm].  TheG-setfor 
the  instance  contains  the  universal  concept  that  includes  ev¬ 
erything  as  positive. 

The  goal  of  learning  is  to  form  three  disjoint  concept  def¬ 
initions  that  cover  the  space  of  instances,  and  this  requires 
two  extensions  to  the  technique  described  above.  The  first 
exploits  the  fact  that  the  learned  concept  definitions  must 
not  overlap.  The  simple  approach  would  be  to  take  the  100 
examples  of  two  of  the  classes  as  negative  examples  for  the 
third  class.  However,  not  only  must  the  concept  definitions 
for  the  third  class  exclude  these  instances,  they  must  ex¬ 
clude  all  instances  included  by  the  final  concept  definition 
for  each  of  the  other  two  classes.  It  is  not  known  what  the 
final  definitions  will  be,  but  it  is  known  that  they  must  be 
more  general  than  some  element  of  the  5-set  for  its  class. 
That  is,  whatever  the  concept  definition,  it  must  at  least  in¬ 
clude  all  instances  covered  by  some  most  specific  concept 
definition  generated  from  the  positive  data  for  that  class.  In 
the  iris  domain  the  5-set  for  each  class  after  processing  the 
positive  data  for  that  class  was  always  singleton,  so  the  final 
concept  definition  for  each  class  must  include  all  examples 
included  by  the  final  5-set  element.  Therefore,  rather  than 
taking  the  50  examples  of  each  class  as  negative  data  for 
the  other  two  classes,  first  only  positive  data  are  processed 
for  each  class,  and  the  resulting  generalization  in  the  5-sct 
is  taken  as  a  generalized  negative  i"  stance  for  the  other  two 


classes,  replacing  the  use  of  the  100  negative  instances  with 
two  generalizations  that  include  the  negative  instances  plus 
additional  instances  that  must  also  be  excluded.  Thus,  for 
example,  all  positive  data  are  processed  for  the  setosa  class, 
and  the  result  in  the  5-set  is  taken  as  a  single  generalized 
negative  instance  for  versicolorand  viginica.  It  summarizes 
the  setosa  data,  as  well  as  additional  instances  that  will  also 
be  included  by  the  final  definition  for  setosa. 

Hie  second  extension  is  that,  since  the  three  concept  def¬ 
initions  that  are  formed  must  cover  the  space,  many  of  the 
more  specific  definitions  for  each  class  can  be  removed, 
since  no  combination  of  definitions  in  the  version  spaces  for 
the  other  two  classes  will  cover  the  space  of  irises.  Instead 
of  ^plying  the  technique  for  searching  the  version  space  to 
find  the  best  concept  definition  (Section  4.1),  only  the  sub¬ 
set  of  the  version  space  that  could  lead  to  class  definitions 
that  cover  the  space  of  irises  is  considered.  The  search  then 
takes  place  in  the  cross  products  of  the  three  much  smaller 
version  spaces.  Furthermore,  selection  of  a  hypothesis  in 
one  space  allows  using  it  as  a  generalized  negative  instance 
for  the  other  version  spaces,  so  not  all  triples  of  concept  def¬ 
initions  from  the  three  version  spaces  need  be  considered. 

S3  Results 

Since  there  is  only  a  fixed  amount  of  data,  the  learning  tech¬ 
nique  was  evaluated  by  dividing  the  data  into  10  sets  of 
15  instances,  five  from  each  of  the  three  classes.  Learning 
took  place  by  processing  nine  of  the  10  data  sets  combined 
and  testing  on  the  tenth  data  set.  This  was  done  for  each 
possible  group  of  nine  data  sets,  and  the  resulting  classi¬ 
fication  rates  were  averaged  across  all  10  runs.  A  typical 
final  result  for  a  run  is  a  rule  set  that  classifies  irises  with 
[petal  length  <  2.4cm]  as  setosa,  irises  with  [petal  length  > 
2.4cm]  and  [petal  width  <  1.6cm]  as  versicolor,  and  irises 
with  [^tal  length  >  2.4cm]  and  [petal  width  >  1.6cm]  as 
viginica. 

The  average  overall  classification  rate  on  the  test  data 
was  97% — on  average  97%  of  the  test  cases  were  placed 
in  the  proper  class.  For  the  setosa  class  alone  the  rate  was 
1(X)%,  as  the  class  is  separable  from  the  other  two.  For 
versicolor  alone  the  rate  was  93%,  and  for  viginica  94%. 
These  rates  are  comparable  to  those  obtained  with  other 
techniques;  for  example,  Dasarathy’s  pattern-recognition 
approach  [Dasarathy,  1980]  obtained  95%  accuracy  (100% 
for  setosa,  98%  for  versicolor,  and  86%  for  viginica);  Aha 
and  Kibler’s  noise-tolerant  nearest-neighbor  method  NT- 
growth  [Aha  and  Kibler,  1989],  also  obtained  95%  accu¬ 
racy  (100%  for  setosa,  94%  for  versicolor,  and  91%  for 
viginica);  and  C4  [Quinlan,  1987],  Quinlan’s  noise-tolerant 
version  of  ID3,  obtained  94%  accuracy  (100%  for  setosa, 
and  9!  %  for  versicolorand  viginica).^  The  results  are  sum¬ 
marized  in  Table  1.  NTgrowth  and  Dasarathy’s  method  are 
both  instance-based;  C4  is  the  only  other  learning  method 
that  can  be  said  to  use  a  concept  description  language.  For 
example,  on  one  run  it  generated  the  decision  tree  that  de¬ 
fines  irises  with  [petal  length  <  2.5cm]  as  setosa,  irises  with 
[petal  length  >  2.5cm]  and  [petal  width  <  1.8cm]  as  ver¬ 
sicolor,  and  those  remaining  ([petal  length  >  2.5cm]  and 
[petal  width  >  1.8cm])  as  viginica.  Note  that,  unlike  this 

*These  latter  two  results  are  due  to  Aha  (personal 
communication). 
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work,  C4  decides  for  itself  where  attribute  values  should  be 
divided. 
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Thble  1:  Predictive  Accuracy  of  Learning  Systems. 


6  Comparison  to  Related  Work 

This  paper  describes  one  approach  to  learning  from  in¬ 
consistent  data  by  only  addressing  the  subcase  of  data 
with  bounded  inconsistency.  Drastal,  Meunuer,  and  Raatz 
[Drastal  et  aL,  1989]  have  proposed  a  related  method  that 
works  in  cases  where  only  positive  data  have  bounded  in¬ 
consistency.  Their  i^proach  is  to  overfit  the  inconsistent 
data,  using  a  learning  technique  capable  of  forming  mul¬ 
tiple  disjuncts,  some  of  which  only  exist  to  cover  anoma¬ 
lous  instances.  After  learning,  they  remove  disjuncts  that 
only  cover  instances  that  can  perturbed  to  fit  under  one 
of  Ae  other  disjuncts,  in  effect  removing  the  disjunctions 
that  only  exist  to  cover  the  anomalous  data.  One  benefit  of 
their  technique  is  that  it  is  applied  after  learning,  focusing 
on  only  those  instances  covered  by  small  disjuncts,  whereas 
here  all  instances  must  be  viewed  as  potentially  anomalous. 
However,  they  make  the  stronger  assumption  that  all  such 
inconsistent  data  fall  into  small  disjuncts.  They  furthermore 
only  handle  positive  data. 

As  mentioned  earlier,  Mitchell  [1978]  presented  an  al¬ 
ternative  approach  to  learning  from  inconsistent  data  with 
version  spaces.  The  key  idea  was  to  maintain  in  parallel 
version  spaces  for  various  subsets  of  the  data.  When  no 
concept  definition  is  consistent  with  all  data,  Mitchell’s  ap¬ 
proach  considers  those  concept  definitions  consistent  with 
all  but  one  instance.  As  more  inconsistency  is  detected,  the 
system  uses  version  spaces  based  on  smdler  and  sm^lcr 
subsets  of  the  data,  which  the  system  has  been  maintaining 
during  learning.  The  number  of  boundary  sets  that  need 
be  maintained  by  this  process  is  linear  in  ^e  total  number 
of  instances  to  ^  processed  /"in  the  worst  case).  This  is 
still  unacceptably  costly.  In  the  ins  domain,  assuming  that 
at  most  10%  of  the  data  should  be  discounted,  this  would 
have  required  updating  30  boundary  sets  for  each  instance. 
Even  using  the  less  reasonable  assumption  that  only  4%  of 
the  data  need  be  ignored  will  result  in  an  order  of  magni¬ 
tude  slow  down.  Furthermore,  Mitchell’s  approach  requires 
knowing  the  absolute  maximum  number  of  incorrectly  clas¬ 
sified  instances,  in  contrast  to  allowing  unlimited  number 
of  errors  as  done  here  (replacing  it  with  a  bound  on  the 
distance  any  instance  may  be  from  a  correct  instance).  Fi¬ 
nally,  the  boundary  sets  for  Mitchell’s  approach  are  much 
larger  titan  for  the  noise-free  case,  since  Mitchell  modifies 
the  candidate-elimination  algorithm  ^-set  updating  method 
to  ignore  negative  data,  and  similarly  positive  data  are  ig¬ 
nored  by  the  modified  G-set  updating  method  (this  allows 


the  use  of  a  linear  number  of  boundary  sets  by  keeping  the 
boundary  sets  for  multiple  version  spaces  in  a  single  bound¬ 
ary  set). 

A  significant  distinction  can  be  made  between  this  work 
and  Mitchell’s  approach,  as  well  as  much  other  work  in 
learning  from  inconsistent  data  (e.g.  [Michalski  and  Lar¬ 
son,  1978;  Quinlan,  1986].  These  approaches  form  con¬ 
cept  definitions  that  perform  well  but  not  perfectly  on  the 
data,  viewing  those  instances  not  covered  as  anomalous. 
Instances  are  effectively  thrown  away,  whereas  here  every 
instance  is  viewed  as  providing  useful  information,  and  the 
final  concept  definition  must  be  consistent  with  the  instance 
or  one  of  its  neighbors.  The  approach  presented  here  works 
on  data  with  bounded  inconsistency.  Other  approaches  han¬ 
dle  a  wider  range  of  inconsistency,  but  cannot  utilize  any 
instances  thatare  just  a  small  ways  off  from  correct;  they  in¬ 
stead  throw  out  such  instances  as  nonheipful.  Furthermore, 
unlike  other  approaches  that  degenerate  as  more  inconsis¬ 
tency  is  imposed  on  the  data,  the  incremental  version-space 
merging  approach  described  here  still  succeeds  even  when 
all  of  the  data  are  subject  to  bounded  inconsistency. 

To  further  demonstrate  this  point  a  series  of  runs  of  the 
learning  method  were  done  on  a  simple,  artificial  domain. 
There  are  three  attributes  that  take  on  real  values  in  the 
range  of  0  to  9.  The  concept  description  language  parti¬ 
tions  attributes  into  three  regions;  greater  than  or  equal  to 
0  and  less  than  3,  greater  than  or  equal  to  3  and  less  than  6, 
and  greater  than  or  equal  6  and  less  than  or  equal  to  9.  A 
concept  definition  is  a  conjunction  of  such  ranges  over  the 
various  attributes.  A  single  preselected  concept  definition 
serves  as  the  target  of  learning. 

Data  were  created  by  randomly  generating  some  value 
for  each  attribute  in  its  legal  range  (0  to  9).  This  instance 
was  then  classified  according  to  the  preselected  target  con¬ 
cept  definition  for  learning.  The  identity  of  each  instance 
is  then  perturbed  by  up  to  one  unit— a  random  number  be¬ 
tween  - 1  and  1  is  added  to  the  given  value  of  each  attribute. 
This  new  instance  is  given  the  classification  of  the  instance 
on  which  it  is  based.  Training  data  generated  in  this  manner 
have  bounded  inconsistency,  since  any  incorrect  instance  is 
never  more  than  1  away  from  a  correct  instance  on  any  at¬ 
tribute. 

The  test  runs  perturb  different  percentages  of  the  data, 
to  test  the  sensitivity  of  the  approach  to  this  factor.  The 
definition  of  “nearby”  used  by  the  learning  method  defines 
one  instance  to  be  near  another  if  the  value  of  each  attribute 
of  the  first  are  within  1  unit  of  the  corresponding  value  for 
the  second  (i.e.,  the  appropriate  definition  of  nearby  was 
selected).  All  attribute  values  for  a  single  instance  may  be 
perturbed.  80  randomly  generated  instances  were  used. 

Thble  2  summarizes  the  results  of  the  test.  The  amount 
of  data  that  was  perturbed  was  allowed  to  vary  from  0%  (no 
data  perturbed-^ta  are  consistent)  to  100%  (all  data  per¬ 
turbed).  In  all  cases  learning  used  a  definition  of  “nearby” 
that  added  1  to  and  subtracted  1  from  the  value  of  each  at¬ 
tribute.  The  result  of  the  experiment  was  that,  in  all  cases, 
incremental  version-space  merging  (with  the  additional  step 
of  selecting  the  best  classifier  from  the  version  space)  con¬ 
verged  to  the  target  concept  definition  that  was  used  to 
generate  the  data.  Unlike  most  other  learning  algorithms 
that  degenerate  as  more  noise  is  introduced  to  the  data,  the 
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technique  was  able  to  correctly  learn  the  desired  concept 
definition  even  when  all  data  are  perturbed  within  known 
bounds.  One  way  to  interpret  these  results  is  that  the  ap¬ 
proach  described  here  provides  a  way  to  use  the  knowledge 
that  bounded  inconsistency  exists,  which  permits  success¬ 
ful  learning  even  when  all  the  data  are  incorrect  (within  the 
known  bounds). 
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7  Discussion 

The  general  approach  described  here  is  to  consider  concept 
definitions  consistent  with  the  instance  or  some  neighbor 
of  the  instance.  The  technique  requires  a  method  for  gen¬ 
erating  all )  .stances  near  a  given  instance,  but  it  does  not 
constrain  a  priori  the  particular  definition  of  “nearby,”  For 
example,  in  tree-structured  description  languages  one  such 
definition  would  be  that  two  feature  values  are  in  the  same 
subtree:  rhombus  and  square  might  be  close  whereas  rhom¬ 
bus  and  oval  might  not. 

Hows\  er,  the  approach  described  here  is  extremely  sen¬ 
sitive  to  both  the  concept  description  language  and  the  defi¬ 
nition  of  “nearby.”  For  a  fixed  language,  if  the  notion  of 
nearby  is  too  small,  the  version  space  will  collapse  with 
no  consistent  concept;  if  it  is  too  large,  each  instance  will 
have  many  neighbors,  and  instance  boundary  sets  will  be 
quite  large,  which  makes  the  approach  computationally  in¬ 
feasible,  Furthermore,  the  fintd  version  space  will  likely 
be  too  big  to  search  for  a  best  concept  after  processing  the 
data.  Similarly,  for  a  fixed  definition  of  nearby,  if  the  con¬ 
cept  description  language  is  too  coarse,  instances  wili  have 
no  neighbors,  whereas  if  the  language  is  too  fine,  then  in¬ 
stances  will  have  too  many  neighbors.  The  choice  of  lan¬ 
guage  and  definition  of  nearby  affects  the  size  of  version 
spaces  and  the  convergence  rate  for  learning  (how  many 
instances  are  required  to  converge,  if  it  is  even  possible). 

The  ideal  situation  for  this  approach  would  be  when  the 
definition  of  nearby  is  given  or  otherwise  known  for  the  par¬ 
ticular  domain,  as  well  as  when  the  desired  language  for 
concepts  is  provided.  However,  it  is  often  the  ense  thst  one 
or  both  are  not  known,  as  was  the  case  for  the  iris  domain 
of  the  previous  section,  which  required  a  few  iterations  be¬ 
fore  a  successful  concept  description  language  and  defini¬ 
tion  of  nearby  was  found.  The  first  description  language 
chosen  used  intervals  of  size  4  millimeters,  rather  than  8 
as  was  finally  selected.  The  first  definition  of  nearby  con¬ 
sidered  instances  with  feature  values  within  2  millimeters 
of  given  feature  values  as  “nearby,”  but  the  version  space 
collapsed — no  concept  definition  remained  after  processing 
the  data — for  the  versicolor  and  viginica  classes.  However, 


a  definition  of  nearby  of  3  millimeters  resulted  in  too  many 
neighbors  for  each  instance,  making  the  process  too  com¬ 
putationally  expensive  (the  program  ran  out  of  memory). 
Since  measurements  were  only  given  to  the  nearest  mil¬ 
limeter,  there  was  no  intermediate  definition  of  nearby  to 
try.  Since  adjusting  the  definition  of  nearby  failed  to  work, 
the  next  step  was  to  adjust  the  language,  and  8  millimeter  in¬ 
tervals  were  chosen.  Fortunately  the  first  attempt  using  the 
new  language  with  a  definition  of  nearby  of  3  millimeters 
yielded  nonempty  version  spaces  of  reasonable  size.  Crite¬ 
ria  for  selecting  ^propriate  description  languages  and  def¬ 
initions  of  nearby  is  an  area  for  future  work. 

lb  further  demonstrate  this  issue  a  number  of  runs  were 
made  on  the  artificial  learning  task  of  the  previous  section. 
In  these  experiments  the  concept  description  language  was 
fixed  (using  the  same  language  as  in  the  previous  section) 
and  the  definition  of  nearby  (the  amount  of  inconsistency 
assumed  present  by  the  learning  method)  was  varied.  100% 
of  the  attribute  values  were  perturbed  by  up  to  1  unit.  This 
set  of  experiments  explores  how  the  definition  of  nearby  af¬ 
fects  the  size  of  the  final  version  space,  as  well  as  the  sizes  of 
the  boundary  sets  for  each  instance  version  space.  The  com¬ 
putational  complexity  of  incremental  version-space  merg¬ 
ing  is  sensitive  to  boundary-set  size  [Hirsh,  1989b;  19901. 
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The  results  of  these  experiments  are  sumtn' 
bles  3  and  4.  There  are  three  attributes,  and  alloge. 
are  216  concept  definitiems  in  the  language.  The  different 
rows  correspond  to  different  definitions  of  nearby — how 
much  is  added  to  or  subtracted  from  each  instance.  This 
was  varied  from  0  to  3 — the  maximum  distance  apart  the 
feature  values  of  two  instances  can  be  so  that  the  instances 
are  still  considered  neighbors  (column  1  in  the  tables).  Note 
that  the  real  amount  of  variance  imposed  on  values  when 
generating  data  was  at  most  1 — no  more  than  1  was  added 
to  or  subtracted  from  the  randomly  generated  value  for  the 
feature.  The  second  column  of  Thble  3  summarizes  the  size 
of  the  final  version  space  after  all  80  instances  have  been 
processed.  Note  that,  while  learning  is  impossible  here  if 
the  data  are  assumed  consistent,  and  convergence  is  possi¬ 
ble  using  the  80  instances  if  the  definition  of  nearby  adds 
or  subtracts  only  1  to  each  value  (the  actual  value  used  in 
generating  the  inconsistent  data),  as  the  value  for  nearby  in¬ 
creases  to  2  and  3  the  final  version-space  size  increases.  The 
third  column  presents  the  average  number  of  neighbors  for 
each  instance.  As  would  be  expected,  this  number  increases 
significantly  (exponentially)  as  the  amount  by  which  values 
may  be  perturbed  increases.  The  values  are  close  to  their 
expected  values  (obtainable  by  case  enumeration)  of  1.0, 
3.0  (=  (13/9)^),  6.7  (=  (17/9)^),  and  12.7  (=  (7/3)^). 
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Thble  4  summarizes  the  average  size  of  boundary  sets  for 
each  instance.  Since  positive  instances  always  have  a  sin¬ 
gleton  G-set,  and  similarly  negative  instances  always  have 
a  singleton  S-set,  averages  are  also  given  for  the  S-  and  G- 
sets  when  excluding  these  cases.  As  expected,  as  the  defini¬ 
tion  of  nearby  grows  more  generous,  the  various  quantities 
increase. 
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“Nearby.” 

These  results  emphasize  the  need  for  a  sufficiently  gen¬ 
erous,  but  not  overly  generous,  definition  of  nearby.  They 
furthermore  suggest  a  method  for  automating  the  selection 
of  an  appropriate  definition  of  nearby  given  a  fixed  con¬ 
cept  description  language.  The  method  would  begin  by 
assuming  consistency,  then  slowly  increase  the  amount  of 
inconsistency  assumed  to  be  present  in  the  data  (i.e.,  in¬ 
crease  the  generosity  of  the  definition  of  nearby)  until  either 
a  nonempty  version  space  is  generated  or  enough  time  has 
passed  to  believe  that  the  approach  is  not  computationally 
feasible  on  the  given  data  with  the  given  concept  descrip¬ 
tion  language. 

8  Formal  Results 

Recent  theoretical  work  on  concept  learning  (e.g.,  [Haus- 
sler,  1988])  has  developed  techniques  for  analyzing  how  the 
quality  of  results  is  influenced  by  the  amount  of  data  used 
in  learning.  This  section  gives  such  an  analysis  for  learning 
from  data  with  bounded  inconsistency. 

To  do  this  a  number  of  definitions  are  necessary.  First,  the 
function  Neighborsix)  gives  the  set  of  examples  in  the  in¬ 
stance  description  language  that  are  near  x  (for  the  learning- 
task  specific  definition  of  nearby): 

Definition  1: 

Neighbors{x)  =  {y  |  y  is  near  a:}. 

Furthermore,  since  data  may  have  bounded  inconsistency, 
it  is  necessary  to  redefine  what  it  means  for  a  concept  defi¬ 
nition  to  be  consistent  with  an  instance.  A  concept  is  con¬ 
sistent  with  an  example  if  it  cor.  .xtly  classifies  the  example 
or  one  of  its  neighbors: 

Definition  2:  An  instance  x  is  said  to 
be  consistent  with  a  concept  C  (written 
Conaisteni{x,  C))  if,  when  x  is  positive, 

3p  e  Neighhors{x)  with  p&  C,  and  when 
X  is  negative,  3n  €  Neighbors{x)  with 
C  (where  p  eC  means  C  would  have 
classified  p  as  positive,  and  n  G  means  C 
would  have  classified  n  as  negative). 

The  definition  of  the  error  of  a  concept  definition  h  with 
respect  to  a  desired  target  concept  C  is  now  defined  relative 


to  this  new  notion  of  consistency — it  is  the  probability  that 
ft  and  G  disagree: 

Definition  Error{h,  G)  =  the  probabil¬ 
ity  that  for  a  randomly  chosen  instance  x 
classified  as  a  positive  or  negative  example 
of  G,  it  is  not  the  case  that  Consistent  (x,  h) . 

With  these  definitions  it  is  possible  to  map  over  Lemma 
2.2  from  Haussler’s  [1988]  AJ.  Journal  paper: 

Lemma  1:  The  probability  that  some  ele¬ 
ment  of  the  version  space  generated  from  m 
examples  of  G  has  error  greater  than  e  is  less 
than  where  |fr|  is  the  number  of 

expressions  in  the  concept  description  lan¬ 
guage  H  used  by  incremental  version-space 
merging. 

Proof:  Assume  that  some  set  of  hypotheses  hi,..,,hkin 
the  concept  description  language  H  have  error  greater  than 
6  with  respect  to  G.  This  means  that  the  probability  that 
an  example  of  G  is  consistent  with  a  particular  hi  is  less 
than  (1  -  c).  The  probability  that  hi  is  consistent  with  m 
independent  examples  of  G  is  therefore  less  than  ( 1  -  e)"* . 
Finally,  the  probability  that  some  hi  e  hi,..., hk  is  consis¬ 
tent  with  tn  instances  is  bounded  by  the  sum  of  their  indi¬ 
vidual  probabilities,  thus  the  probability  that  some  hi  with 
error  greater  than  e  (with  respect  to  G)  is  consistent  with  m 
examples  of  G  is  less  than  ib(l  -  e)"*.  Since  ib  <  jHl,  and 
(1  -  e)™  <  e"**",  the  probability  of  getting  some  hypoth¬ 
esis  with  error  greater  than  e  consistent  with  m  independent 
examples  of  G  is  less  than  iFle"*"*.  □ 

A  simple  corollary  of  this  Lemma  says  how  many  exam¬ 
ples  are  necessary  to  guarantee  with  high  probability  that 
all  concept  definitions  in  the  version  space  have  low  error: 

Corollary  1:  The  probability  that  all  ele¬ 
ments  of  tlie  version  space  generated  from 
at  least 

M|g|)  +  Mi) 

€ 

examples  of  G  will  have  error  less  than  e  is 

1-6. 

Proof:  Solving  6  <  for  m  gives  the  desired  re¬ 

sult.  □ 

At  first  these  results  may  appear  somewhat  surprising, 
since  they  give  the  same  guarantees  as  Haussler,  yet  ad¬ 
dress  learning  from  inconsistent  data.  The  reason  for  this 
is  that  these  results  use  a  weaker  definition  of  consistent, 
and  hence  use  a  weaker  notion  of  error,  than  the  traditional 
definition  used  by  Haussler. 

Finally,  note  that  these  results  do  not  only  apply  to 
the  version  space  approach  presented  here.  Any  learning 
method  that  generates  concept  descriptions  “consistent”  (in 
the  sense  of  Definition  2)  with  m  examples  of  some  concept 
G  will  have  these  guarantees. 

9  Summary 

This  paper  has  described  an  approach  to  learning  from 
inconsistent  data  that  focuses  on  a  subcase  of  the  prob¬ 
lem,  when  data  have  bounded  inconsistency.  The  key 
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idea  to  learning  from  such  data  is  to  find  concept  defi¬ 
nitions  that  miss  some  of  the  data,  but  only  by  a  small 
amount.  This  approach  was  implemented  using  a  gener¬ 
alization  of  Mitchell’s  version-space  approach  to  concept 
learning  called  incremental  version-space  merging.  The 
central  idea  is  to  allow  version  spaces  to  contain  concept 
definitions  not  consistent  with  the  data,  but  instead  con¬ 
sistent  with  neighboring  data.  This  results  in  considering 
more  concept  definitions  than  the  original  version-space  ap¬ 
proach  would  ordinarily  consider,  decreasing  the  chance 
that  the  desired  concept  definition  is  removed  due  to  an 
anomalous  instance. 
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Abstract 

Many  learning  systems  implicitly  use  the  fit-and- 
split  learning  method  to  create  a  comprehensive 
hypothesis  from  a  set  of  partial  hypotheses.  At  the 
core  of  the  fit-and-split  method  is  the  assignment  of 
examples  to  partial  hypotheses.  To  date,  however, 
this  core  has  been  neglected.  This  paper  provides 
the  first  definition  and  model  of  the  fit-and-split 
assignment  problem.  Extant  systems  perform 
assignment  nearly  arbitrarily,  implicitly  using,  for 
example,  greedy  set  covering.  Tliis  paper  also 
presents  Conceptual  Set  Covering  (CSC),  a  new 
assignment  algorithm.  An  extensive  empirical 
evaluation  over  a  wide  range  of  learning  problems 
suggests  that  CSC  can  improve  any  fit-and-split 
learning  system. 

1  Introduction 

One  way  to  solve  a  complex  learning  problem  is  to 
decompose  it  into  simpler  learning  problems.  The  fit-and- 
split  learning  method  is  perhaps  tlie  most  direct  way  to  do 
such  a  decomposition. 

FIT-AND-SPLIT  LEARNING  METHOD 
Input  Examples  and  learning  problem  called  the  partial 
learner 
Procedure: 

1)  Fitting  —  Use  the  partial  learner  to  find  a  set  of 
partial  hypotheses  such  that  each  example  is  covered 
by  at  least  one  partial  hypothesis. 

2)  Splitting,  part  1  —  Assign  each  example  to  one  of 
the  partial  hypotheses  that  covers  it. 

3)  Splitting,  part  2  — For  each  partial  hypothesis, 
create  a  general  decision  rule  that  covers  the  assigned 
examples. 

Output:  A  final  hypothesis  in  the  form  of  a  decision  ?? 
listsis  de 

if  decision  rulei 

then  apply  partial  hypothesis^ 
else  decision  rule, 

then  apply  partial  hypothesis, 

else  decision  rulOn 

then  apply  partial  hypothesisn 


The  fit-and-split  learning  method  is  used  by  many  machine 
learning  systems  of  every  type.  For  example,  it  is  used  by 
operator  learners  [Kadie,  1989;  Shen  and  Simon,  1989],  by 
automatic  programmers  [Summers,  1977],  by  discovery 
systems  [Falkenhainer  and  Michalski,  1986],  and  by  inte¬ 
grated  empirical/explanation-based  systems  [Drastal  et  al., 
1989].  All  these  systems  concentrate  on  step  one  of  the 
method  (partial-hypothesis  creation)  or  step  three 
(decision-rule  creation).  Extant  systems  typically  neglect 
step  two,  example  assignment;  they  do  the  assignment 
without  regard  to  the  effect  on  the  final  hypothesis  using, 
for  example,  the  greedy-set-covering  algorithm. 

This  paper  describes  a  new  assignment  algorithm  called 
Conceptual  Set  Covering  (CSC).  CSC  tries  to  assign 
examples  to  partial  hypotheses  so  as  to  minimize  the  total 
number  of  disjuncts  in  the  decision  rules  of  the  final 
hypothesis.  The  result  is  a  final  hypothesis  that  is  usually 
simpler  and  more  accurate  tlian  tliat  produced  by  alterna¬ 
tive  methods.  Empirical  evidence  suggests  that  the  adop¬ 
tion  of  CSC  can  improve  any  fit-and-split  learner. 

2  Fit-and-Split  Learning  Problem 

This  section  illustrates  the  fit-and-split  learning  problem 
with  an  example,  defines  the  learning  problem,  and  then 
uses  a  learning-problem  generator  to  specify  a  parameter¬ 
ized  model  of  typical  fit-and-split  learning  problems.  Later 
the  model  will  be  used  to  motivate  a  heuristic  method. 

2.1  Example 

Consider  an  example  from  tlie  domain  of  empirical 
discovery.  The  target  function  (unknown  to  the  learner)  is: 

If  JC>1  then  sin^jjc] 

else  if  JC<-4  then  1 
else  -JC. 

In  addition,  suppose  that  the  learning  program  is  given  five 
example  input/output  pairs:  (-5,1),  (-2, 2),  (-1,1),  (2,0), 
and  (5,1).  Figure  la  shows  the  target  function  and  the 
examples.  As  the  learner’s  first  step,  it  calls  tlie  partial 
learner  to  create  a  set  of  partial  hypotheses.  Suppose  the 
partial  learner  creates  the  functions  shown  in  figure  lb. 


1  Support  was  provided  by  the  Fannie  and  John  Hertz  Foundation  and  ONR  grant  N00014-88-K 124. 
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a) 


b) 


Figure  1:  a)  The  Target  Function  and  Example 
Points  —  From  the  example  points  the  learner  tries 
to  create  an  accurate  approximation  of  the  function, 
b)  The  partial  learner  is  used  to  create  a  set  of  partial 
hypotheses  such  tliat  each  example  is  covered  by  at 
least  one  partial  hypothesis.  Here,  each  example  is 
an  input/cutput  pair  and  each  partial  hypothesis  is  a 
function.  A  partial  hypothesis  covers  an  example  if 
it  includes  the  point  defined  by  the  example. 


None  of  these  partial  hypotlieses  cover  all  examples,  so 
they  must  be  combined  in  a  decision  list  The  decision  rules 
for  the  decision  list  are  typically  formed  by  a  concept 
learner  such  as  AQ,  ID3,  or  PLS  [Michalski  and  Chilausky, 
1980;  Quinlan,  1986;  Rcndell,  1986].  Such  a  concept 
learner  takes  as  input  a  set  of  positive  and  negative  exam¬ 
ples.  It  returns  a  decision  rule  that  covers  (that  is,  returns 
tme  when  applied  to)  each  positive  example  but  does  not 
cover  any  negative  example. 

The  assignment  problem  is  thus  two-fold.  First  the 
assignor  must  decide  where  in  the  decision  list  each  partial 
hypothesis  will  appear.^  Second,  for  each  partial  hypothesis 
occurrence,  the  assigner  must  decide  which  examples  are 
to  be  covered  by  that  occurrence.  The  next  section  shows 
how  the  choices  made  by  the  assigner  can  affect  the 
complexity  (and  thus  the  accuracy)  of  the  final  hypothesis. 

2.2  Fit-and-SpIit-Assignment  Problem 

A  geometric  interpretation  of  the  assignment  problem 
will  help  illustrate  how  good  assignments  lead  to  good  final 
hypotheses  and  how  bad  assignments  lead  to  bad  final 
hypotheses. 

A  decision  list,  such  as  the  final  hypothesis,  can  be 
plotted  as  an  ordered  set  of  orthogonal  rectangles.^  Each 
rectangle  represents  a  disjunct  within  a  decision  rule  of  the 
decision  list.  The  desire  for  a  simple  final  hypothesis 
translates  into  a  preference  for  the  decision  list  plotted  in 
figure  2b  over  the  one  plotted  in  figure  2a,  because  figure  b 
uses  fewer  rectangles  and,  thus,  fewer  disjuncts. 

The  fit-and-split  problem  can  be  fonnalized  in  terms  of 
the  geometric  interpretation: 

FIT- AND-SPLIT  PROBLEM 
Given:  A  set  of  examples  and  a  partial  learner 
Find:  A  set  of  covering  partial  hypotheses  and  a  set  of 
rectangles  (disjuncts)  such  that: 

•  Every  rectangle  is  labeled  with  a  partial  hypothesis. 

•  The  set  of  rectangles  is  ordered.  (The  relative  priority 
of  rectangles  with  the  same  label  is  not  important.) 

•  Every  example  must  be  inside  at  least  one  rectangle. 

•  Every  example  must  be  compatible  with  the  first 
rectangle  that  it  lies  within.  An  example  is  said  to  be 
compatible  with  a  rectangle  if  the  example  is  covered 
by  the  rectangle’s  label. 

•  The  number  of  rectangles  is  as  small  as  possible. 


2  A  partial  hypothesis  can  appear  more  than  once  in  the  decision  list. 

3  The  term  rectangles,  as  used  here,  includes  hyper-rectangles  and  open-ended  rectangles  which  extend  to  infinity  along 
one  or  more  axes. 


42  Kadie  • 


a) 

. 

2.IW-X 


PW-I.  I 

P(*>*1. 

pw-t) 

POi-*) 

T 

-5  -4 

-3  -2 

-1  1  0 
_ 1 _ 

1  2  3 

4  5 

b) 

s.iw-1 


M!i=j _ r 


P(»>-') 

PW-»J 

1 - 

j 

1  IW-*) 

-5  -4 

■3 

-2 

!■'  » 

1  2  3 

4  S 

1 

i _ _ _ 

Figure  2:  Geometric  Interpretation  —  Examples 
are  represented  by  labeled  points.  Disjuncts  are 
represented  with  rectangles.  Examples  are  labeled 
with  the  partial  hypotheses  that  cover  them.  Rect¬ 
angles  are  labeled  with  a  single  partial  hypothesis 
and  a  number.  Low-numbered  rectangles  have 
priority  over  high-numbered  rectangles.  Figure  a 
shows  the  examples  covered  with  five  rectangles 
(and  three  partial  hypotheses).  Figure  b  shows  the 
examples  coveted  with  three  rectangles  (and  three 
partial  hypotheses).  Because  figure  b  covers  the 
examples  with  fewer  rectangles,  it  is  preferred. 


Table  1:  The  Assignment  Lists  Corresponding  to 
Figure  2. 


Partial  Hypothesis 

a) 

Assigned  Examples 

1. f(x)=r 

2. f(x)  =  -r 

{-5, -1,5} 

{-2} 

II 

a 

{2} 

Partial  Hypothesis 

b) 

Assigned  Examples 

l.f(r)  =  -JC^ 

{-2,-1} 

2.f(jc)  =  sin(^5rj 

{2,5} 

3.f(r)  =  l 

{-5} 

Rather  than  looking  directly  for  the  ordered  set  of 
rectangles,  a  learner  can  look  for  an  assignment  list.  Table 
1  shows  the  assignment  list  corresponding  to  figure  2.  An 
assignment  list  is  an  ordered  set  of  two  pairs.  The  first 
component  of  each  pair  is  a  partial  hypothesis.  The  second 
component  is  the  set  of  examples  that  are  assigned  to  that 
partial  hypothesis.  An  example,  (x,y),  is  represented  by  its 
jc  component.  Although  not  illustrated  here,  a  partial 
hypothesis  may  appear  in  a  list  more  than  once  (or  not  at 
all).  Examples,  however,  appear  exactly  once.  In  addition, 
an  example  must  be  assigned  to  a  partial  hypothesis  that 
covers  it  Rectangles  are  formed  from  assignment  lists 
according  to  the  procedure  listed  in  figure  3. 

1.  For  each  element, 

<PARTIAL_HYPO„x_seti>  in  the  assignment 
table  (in  order): 

n 

la .  Let  POS  =  x_seti.  Let  NEG  =  u  x_setj, 

j-i+i 

where  n  is  the  length  of  the  assignment 
table . 

lb.  Dse  a  concept  learning  program  to 
create  a  set  of  rectangles  /?£C_iS'£7’„ 
that  distinguishes  POS  from  NEG. 

2.  Output 

"If  RECJET^  then  PAKTlALJiYPO^ 
else  if  RECJSET^  then  PARTIALJJYPO^ 

else  if  REC_SET„  then  PARTIAL_HYPO,” 

Figure  3:  The  Assignment-List-to-Decision-List 
Algorithm  —  For  each  row  of  the  table,  the  algo¬ 
rithm  produces  a  decision  rule  that  distinguishes  the 
examples  on  the  row  from  the  examples  on 
subsequent  rows. 


Now  the  fit-and-split-assignment  problem,  a  subprob¬ 
lem  of  the  fit-and-split  problem,  can  be  defined: 

FIT-AND-SPLIT-ASSIGNMENT  PROBLEM 
Given:  A  set  of  examples  and  a  covering  set  of  partial 
hypotheses. 

Find:  An  assignment  list  such  that  the  decision  list  pro¬ 
duced  by  the  assignment-list-to-decision-list  algorithm 
will  contain  as  few  rectangles  as  possible. 

In  the  one-dimensional  case  the  problem  does  not  look 
difficult.  The  example  space,  however,  can  be  n  -dimen¬ 
sional  and  non-Euclidean.  In  general  the  problem  is 
fflg’-iiaxd  (because  any  set-eoveriiig  problem  [Garey  and 
Johnson,  1979]  can  be  reduced  to  a  fit-and-split  assignment 
problem  with  an  example  space  consisting  of  one  nominal 
dimension). 
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Because  the  problem  is  :A^i’-hard,  it  is  natural  to  look  for 
heuristic  methods  with  good  performance  on  typical  cases, 
but  first,  typical  fit-and-split-assignment  problems  must  be 
characterized.  This  will  be  done  by  modeling  fit-and-split- 
assigtunent  problems  with  a  parameterized  problem  gen¬ 
erator. 

2.3  Model  of  Fit-and-SpIit-Assignment  Problems 

The  first  step  toward  describing  a  model  of  assignment 
problems  is  to  introduce  some  new  terms.  In  the  example 
problem,  three  examples  are  unambiguous,  that  is,  each  is 
covered  by  exactly  one  partial  hypothesis,  called  its  irue 
partial  hypothesis.  Unambiguous  examples  constrain 
assignment  algorithms  because  each  unambiguous  exam¬ 
ple  must  be  assigned  to  the  example’s  tme  partial  hypoth¬ 
esis.  Ambiguous  example  are  those  that  are  covered  by  two 
ormore  partial  hypotheses:  one  true  partial  hypothesis  and 
one  or  more  coincidental  partial  hypotheses.  A  look  back 
at  the  target  function  of  the  example  problem  (figure  la), 
shows  that  example  (-1, 1)  is  covered  by  its  tme  partial 
hypothesis,  f(x)  =  ~x,  and  one  coincidental  partial  hypoth¬ 
esis,  f(x)  =  1.  Likewise,  example  (5, 1)  is  tmly  covered  by 

f(x)  =  sin^;jcj  and  coincidentally  covered  by  f(x)=l.  In 

general,  every  example  will  be  assumed  to  have  one  tme 
partial  hypothesis,  and  zero  or  more  coincidental  partial 
hypotheses.  The  success  of  an  assignment  algoritlun  will 
depend  in  part  on  how  well  it  can  distinguish  tme  partial 
hypotheses  from  coincidental  partial  hypotheses. 

With  these  terms  in  mind,  a  typical  fit-and-split- 
assignment  problem  can  be  characterized  by  describing  a 
problem  generator.  The  generator  is  listed  in  figure  4. 

2.4  Related  Learning  Problems 

The  fit-and-split-assignment  problem  is  a  specialization 
of  the  overlapping-concept-leaming  problem  [Michalski, 
1983;  Cohen  and  Feigenbaum,  1982],  where  a  class  corre¬ 
sponds  to  a  partial  hypothesis.  It  differs  from  standard 
overlapping  concept  learning  in  two  ways.  First,  examples 
are  labeled  with  every  class  (that  is,  partial  hypoth  isis)  that 
covers  them.  In  the  standard  problem,  examples  an  labeled 
with  only  one  covering  class  at  a  time.  Secono,  eveiy 
example  is  covered  by  one  tme  class  and  zero  or  more 
coincidental  classes.  In  other  words,  examples  are  always 
correctly  labeled  with  their  tme  class,  but  because  of 
noise/coincidence,  they  may  also  be  labeled  wit{i  zero  or 
more  coincidental  classes.  Unlike  other  noise  models, 
coincidence  is  consistent,  that  is,  if  an  example  comes  up 
again,  it  will  be  labeled  the  same  way.  In  the  standard 
problem,  overlap  is  caused  by  examples  having  two  or 
more  tme  classes. 


1.  Start  with  an  example  space.  In 
general,  any  example  space  is  permis¬ 
sible.  (For  the  tests  of  section  4, 
the  example  space  is 

(1, 10]  X  [0.0, 10.0]  X  {rtdfSrun,  6tue^  X  [1,20].) 

2.  Randomly  partition  the  space  into 
regions .  The  space  is  split  by  rep¬ 
eatedly  dividing  the  space  with  a 
hyperplane.  The  hyperplane  is  defined 
with  a  randomly  chosen  value  on  a 
randomly  chosen  axis.  The  random 
choices  are  made  according  to  uniform 
distributions .  The  number  of  splits 
to  be  made  is  specified  by  the  prod¬ 
uct  of  two  parameters,  number  of 
disjuncts  per  true  partial  hypotheses 
and  number  of  true  partial  hypoth¬ 
eses. 

3 .  Partial  hypotheses  are  randomly 
selected  from  a  finite  set  of  possi¬ 
ble  true  partial  hypotheses,  PT.  The 
selection  is  done  according  to  the 
uniform  distribution.  A  parameter 
called  number  of  true  partial  hypoth¬ 
eses  specifies  the  size  of  PT. 

4 .  Example  points  are  selected  from  the 
space  by  randomly  selecting  a  value 
on  each  axis.  Values  are  chosen 
according  to  the  uniform  distribu¬ 
tion. 

5a.  If  the  point  has  been  generated 
before,  the  partial  hypothesis  that 
covered  it  before  covers  it  now. 

5b.  If  the  point  is  new,  it  is  always 
covered  by  the  true  partial  hypoth¬ 
esis  of  the  region  from  which  it 
comes  and  may  be  covered  by 
coincidental  partial  hypotheses . 
Coincidental  partial  hypotheses  come 
from  PC,  a  set  formed  by  randomly 
adding  between  0  and  |Prp-l  new 
partial  hypotheses  to  the  set  PT.  The 
probability  that  a  partial  hypothesis 
in  PC  will  cover  an  example  is  a 
parameter  called  chance  of  coinci¬ 
dence  . 

Figure  4:  Assignment  Problem  Generator  —  This 
generator  provides  a  parameterized  model  of  typi¬ 
cal  assignment  learning  problems.  The  parameters 
are  number  of  true  partial  hypotheses,  number  of 
disjuncts  per  true  partial  hypotheses,  and  chance  of 
coincidence. 
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3  Conceptual-Set-Covering  Algorithm 

The  most  straightforward  way  to  try  to  solve  a  fit-and- 
split  assignment  problem  is  with  the  greedy-set-covering 
algorithm  shown  in  figure  5. 


1.  Let  /  =  1. 

2.  Find,  BEST_PARTIAL_HYPO,  the  partial 
hypothesis  that  covers  the  most  exam¬ 
ples  . 

2a.  Let  PARTIALJiYPO,  =  BEST_PAR’nAL_HYPO. 
Let  x_seti  be  the  x  components  of  all 
the  examples  that  BESTJPARTIALJiYPO 
covers.  Remove  all  the  covered  exam¬ 
ples  from  consideration. 

2b.  Increment  i  and  repeat  at  step  1 
until  all  examples  are  removed. 

3 .  Return  the  assignment  list 


Figure  ,5:  The  Greedy-Set-Covering  Algorithm 

The  problem  with  the  grecdy-set-covering  algorithm  is 
that  it  does  not  go  far  enough.  It  tries  to  minimize  the 
number  of  decision  rules  in  the  final  hypothesis,  but  does 
not  try  to  minimize  the  number  of  disjuncls  within  those 
decision  rules. 

When  research  on  the  clustering  problem  faced  a  similar 
situation,  conceptml-clmering  algorithms  were  devel¬ 
oped.  These  are  algorithms  that  look  for  a  solution  that  can 
be  expressed  simply  [Pitt  and  Rcinke,  1988;  Stepp  and 
Michiski,  1989].  In  a  similar  manner,  the  Conceptual- 
Set-Covering  algorithm  (CSC)  tries  to  find  a  solution  that 
can  be  expressed  simply. 

The  CSC  is  listed  in  figure  6.  The  algorithm’s  applica¬ 
tion  on  the  example  problem  is  illustrated  in  figure  7. 


1.  Use  greedy  set  covering  to  find  a 
small,  covering  set  of  partial 
hypotheses.  Attach  an  empty  assign¬ 
ment  set,  O,  to  each  partial  hypoth¬ 
esis. 

2.  Give  each  example  an  ambiguity  score 
equal  to  the  number  of  partial 
hypotheses  that  cover  it.  (Unambigu¬ 
ous  examples  will  have  an  ambiguity 
score  of  1 . ) 

3 .  Of  all  the  partial  hypotheses  that 
cover  an  example  with  the  best  (low¬ 
est)  ambiguity  score,  pick  one  that 
covers  the  most  examples .  Call  this 
BEST_PARTIALjiYPO. 

4.  Every  example  is  an  input/output 
pair,  ix,y).  Create  a  list  of  all 
examples  that  1)  are  covered  by 
BESTJPARTIALJIYPO  and  2)  have  the 
best  ambiguity  score.  Let  POS  be  the 
X  components  of  this  set.  Let  NEG  be 
a  list  of  the  x  components  of  all 
examples  not  covered  by 
BESTJPARTIALJiYPO . 

5 .  Use  a  concept  learner  to  create  a 
preliminary  decision  rule  that  dis¬ 
tinguishes  the  examples  in  POS  from 
those  in  UEG. 

6 .  Add  the  X  component  of  examples 
covered  by  this  preliminary  decision 
rule  to  BESTJARTIALJHYPO's  assign¬ 
ment  set.  Remove  these  examples  from 
consideration.  Note  that 
BESTJARTIALJHYPO  is  not  removed  from 
consideration  because  a  partial 
hypothesis  can  become 
BESTJARTIALJIYPO  more  than  once. 

7.  Until  no  examples  remain,  repeat  at 
step  2 . 

8.  Create  an  assignment  list  by  order¬ 
ing  the  partial  hypotheses  according 
to  the  size  of  their  assignment  sets 
(that  is,  a  partial  hypothesis  to 
which  two  examples  has  been  assigned 
is  listed  before  a  partial  hypothesis 
to  which  one  example  has  been 
assigned) .  Return  this  assignment 
list. 


Figure  6:  The  Conceptual-Set-Covering  Algorithm 
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1.  Greedy  set  covering  finds  three  partial  hypotheses, 
f(x)  =  -1,  f(x) = X,  and  f(x)  =  sinQxj 

2.  The  ambiguity  score  of  each  example  is  computed. 
Examples  (-5,1),  (-2,2),  and  (2,0)  are  covered  by 
one  partial  hypothesis  each,  and  so,  have  ambiguity 
score  1. 

3.  CSC  chooses  f(x)  =  1  as  the  best  partial  hypothesis 
because  it  covers  an  example,  (-5, 1 ),  with  ambiguity 
score  1  and  because  it  covers  a  total  of  three  exam¬ 
ples. 

4. POS  is  {-S^.NEG  is  {-2,2}. 

5.  PLS-LISP,  a  concept  learner,  is  given  POS  =  {-5} 
and  NEG  =  {-2,2}.  It  returns  preliminary  decision 
rulex<-3. 

6.  Example  (-5, 1)  is  added  to  the  assignment  set  of 
partial  hj-pothesis  f(x)=  1.  Because  the  assignment 
set  was  empty,  it  is  now  {-5}.  Example  (-5,1)  is 
removed  from  consideration. 

7.  The  second  time  around:  BESTJPARTIALJiYPO  is 
f(x)  =  -x.  POS  is  {-2}.  NEG  is  {2,5}.  PLS-LISP 
returns  x  <  0.  So,  (-2,2)  and  (-1, 1)  ate  assigned  to 
f(x)  =  -X  and  removed. 

The  third  time  around;  BESTJPARTIALJIYPO  is 
f(x)  =  sin^;xj.  POS  is  {2}.  NEG  is  0.  PLS  returns 
TRUE.  Examples  (2,0)  and  (5,1)  ate  assigned  to 
f(x)  =  sin(fx]. 

8.  The  result  is  the  assignment  list  shown  in  table  lb. 
The  assignrnent-list-to-decision-list  algorithm  (fig¬ 
ure  3)  is  run  on  this  assignment  list.  The  output  is; 

if  X  <  1  and  X  >  -3  then  -x 

else  if  X  >  -1  then  sin^jxj 
else  1 

Figure  7:  The  Conceptual-Set-Covering  Algorithm 

Applied  to  the  Example  Problem  of  Figure  1. 


4  Evaluation 

The  hypotheses  produced  by  CSC  are  as  simple  or 
simpler  than  those  produced  by  greedy  set  covering.  Two 
series  of  tests  were  used  to  see  if  this  increased  simplicity 
(as  measured  by  the  number  of  disjuncts  is  the  final 
hypothesis)  would  translate  into  more  effective  learning  (as 
measured  by  accuracy  on  validation  test  sets).  The  first  test 
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significantly  better  than  greedy  set  covering.  Every  test  in 
the  first  test  series  started  with  a  randomly  generated  target 


problem.  Targets  were  generated  according  to  the  methods 
and  distributions  described  in  section  2.3.  A  validation  test 
set  containing  100  examples  was  generated.  Multiple 
training  sets  of  size  5, 10, 20, 50, 100, 150,  and  250  were 
generated.  For  each  size,  29  training  sets  were  generated. 
Four  learners  were  tested.  They  are  listed  in  table  2.  The 
95%  confidence  intervals  were  determined  using  the 
t  -distribution.  Figure  8  shows  the  results  of  one  such  test. 
With  training  sets  of  size  100  and  larger,  CSC  performed 
significantly  better  than  the  other  learners. 

Table  2:  The  Four  Learners  Tested  —  Each  con¬ 
sists  of  an  assignment  algorithm  and  a  concept 
learner. 


Name 

Assigner 

CSC 

CSC 

GreedyO 

greedy  set  cov¬ 
ering 

GreedyS 

greedy  set  cov¬ 
ering 

Most 

all  examples  to 

Freq 

the  most  cover¬ 
ing  partial 
hypothesis 

Concept  Learner 

Decision-tree  learner  with  no 
pmning  [Quinlan,  1986;  Ren- 
dell,  1986] 

Decision-tree  learner  with  no 
pnming 

Decision-tree  learner  with 
pruning  parameter  r„  set  to 
3.0  [Rendell,  1986] 
Decision-tree  learner  with  no 
pmning 


The  first  test  series  showed  that  CSC  could  perform 
belter  than  greedy  set  covering.  The  second  test  series  was 
designed  to  find  where  each  algorithm  was  best.  The  space 
of  possible  targets  was  surveyed.  The  number  of  tme 
partial  hypotheses  was  set  to  2,  5,  or  10.  The  expected 
number  of  additional  coincidental  partial  hypotheses  was 
thus  3.5, 17.0, 59.52,  respectively.  The  number  of  disjuncts 
per  hue  partial  hypothesis  was  set  to  1,3,  or  5.  And  the 
coincidence  probability  was  set  to  0%,  5%,  10%,  20%  or 
50%.  At  each  point  in  target  space,  three  targets  were 
generated.  For  each  target,  training  sets  of  size  5, 10,  20, 
50, 100, 150,  and  250  were  generated.  Thus,  altogether  945 
(3  X  3  X  5  X  3  X  7)  tests  were  conducted,^ 

Because  the  complexity  of  the  targets  varied  widely,  the 
average  learning  curve  had  high  variance.  Instead  of  com¬ 
puting  the  average,  a  belter  statistic  can  be  computed  from 
the  rank  ordering  of  the  four  learners  on  each  learning 
problem  (that  is,  which  learner  did  best,  which  did  second 
best,  which  did  third  best,  and  which  did  worst).  The 


4  In  fact,  each  test  was  repeated  with  5  different  tests  sets.  The  results  from  these  5  subtests  were  averaged  to  produce  a 
single  result  for  each  of  the  945  tests. 
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Figure  8:  Learning  Curves  for  Four  Assignment 
Algorithms  —  The  target  was  randomly  generated 
with  5  true  partial  hypotheses,  3  disjuncts  per  true 
partial  hypothesis,  and  a  20%  coincidence  proba¬ 
bility. 


statistical  significance  of  tlie  ranking  can  be  evaluated  with 
the  Friedman  F,  -test  for  randomized  block  design  with  a 
set  to  5%  [McClave  and  Dietrich,  1985].’ 

Figure  9  shows  that  CSC  did  best,  followed  by  GreedyO, 
Greedy 3,  and  Most  Freq.  Under  the  assumption  that  the 
tests  arc  representative,  the  Friedman  F,  -test  indicates  that 
the  performance  difference  between  CSC  and  GreedyO  is 
significant  to  at  least  the  99.5%  level. 

The  performance  of  the  learners  was  also  analyzed  at 
each  surveyed  point  in  target  space.  As  section  2.3  detailed, 
target  space  is  described  with  three  parameters  the  number 
of  trae  partial  hypotheses  in  the  target,  the  number  of 
disjuncts  per  true  partial  hypothesis,  and  the  coincidence 
probability.  At  no  surveyed  point  in  target  space  did 
another  learner  do  significantly  belter  than  CSC.  At  many 
points  CSC  was  ranked  first  and  did  significantly  better 
than  the  second  ranked  learner  (see  table  3). 
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Figure  9:  Frequency  of  Each  Ranking  for  Each 
Learner  —  CSC  was  ranked  first  or  lied  for  first  on 
74%  of  the  tests.  The  second  best  learner,  GreedyO, 
was  ranked  first  or  lied  for  first  oh  about  36%  of  the 
tests. 


Table  3:  Observed  Points  in  Target  Space  Where 
CSC  Did  Significantly  Better  than  the  Number  Two 
Rated  Learner  —  At  no  point  in  target  space  did 
another  learner  perfomi  significantly  better  than 
CSC. 


#  of  disjuncts  per  true  partial  hypothesis 


#  of  true 
partial 
hypotheses 

1 

3 

5 

2 

10-20% 

5%,20-50% 

5 

5-50% 

5-20% 

5%,50% 

10 

10-20% 

5% 

5-20% 

5  Discussion  and  Conclusion 

This  paper  identifies  the  assignment  problem  as  an 
important  but  overlooked  part  of  fit-and-split  learning.  It 
contributes  both  a  definition  and  a  model.  The  character¬ 
ization  explains  why  poor  assignments  result  in  a  poor  final 
hypothesis. 


5  This  statistical  testis  often  used  to  decide  if  a  new  medical  treatment  is  significantly  better  than  an  established  treat¬ 
ment.  It  is  nonparametric,  that  is,  it  requires  no  assumption  about  distribution. 
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Concept  Set  Covering  (CSC)  is  a  new  algorithm  tliat 
tries  to  make  intelligent  fit-and-split  assignments.  It  works 
first  with  the  most  unambiguous  examples  (that  is,  those 
examples  covered  by  the  fewest  partial  hypotheses).  The 
resulting  preliminary  decision  rules  often  remove  ambigu¬ 
ous  examples  from  consideration. 

CSC  was  evaluated  on  hundreds  of  learning  problems. 
The  complexity  of  the  learning  problems  was  varied  sys¬ 
tematically  along  three  axes:  1)  the  number  of  true  partial 
hypotheses,  2)  the  number  of  disjuncts  per  true  partial 
hypothesis,  and  3)  the  coincidence  probability.  At  many 
points  in  problem  space,  CSC  did  significantly  better  than 
alternative  methods.  It  never  did  significantly  worse.  Thus, 
the  empirical  evaluation  suggests  that  (at  least  to  the  extent 
that  the  hundreds  of  learning  problems  were  representa¬ 
tive)  CSC  is  the  better  algorithm. 

Several  limitations  of  the  fit-and-split  method  and  the 
CSC  algorithm  should  be  kept  in  mind.  First,  although  the 
fit-and-split  method  is  a  simple  way  to  decompose  a 
learning  problem,  it  may  not  always  be  the  best.  Sometimes 
it  might  be  better  to  first  split  and  then  fit  [Tcheng,  et  al., 
1989].  Also,  because  the  choice  of  partial  hypotheses 
(fitting)  can  affect  the  quality  of  the  decision  niles  (split¬ 
ting),  it  might  sometimes  be  best  to  integrate  the  fitting  and 
splitting  processes.  The  CSC  algorithm  also  has 
limitations.  It  is  heuristic;  its  perfonnance  is  not  guaran¬ 
teed.  Indeed,  the  empirical  evidence  indicating  that  it 
performs  well  assumes  that  the  model  produces 
representative  assignment  problems.  The  model  was  cho¬ 
sen  for  its  simplicity  and  because  it  seems  to  fit,  at  least  as 
a  first  approximation,  problems  such  as  scientific 
discovery.  Study  of  other  fit-and-split  problems  might  lead 
to  different  models  and  different  heuristic  algorithms. 
Also,  CSC  works  only  on  all-or-nothing  partial  hypotheses. 
It  does  not  work  with  probabilistic  partial  hypotheses  ("the 
probability  that  this  partial  hypothesis  covers  this  example 
is  0.9").  Probabilistic  partial  hypotheses  is  an  area  of  future 
research. 

Despite  these  limitations,  the  fit-and-split  method  and 
CSC  offer  important  benefits.  The  fit-and-split  method  is  a 
straightforward  and  efficient  way  to  simplify  learning 
problems.  CSC,  according  to  the  empirical  evaluation, 
works  significantly  better  than  assignment  methods  cur¬ 
rently  in  use.  In  addition,  CSC  is  ..idely  applicable.  Thus, 
the  use  of  the  CSC  assignment  algorithm  is  likely  to 
improve  any  fit-and-split  learning  system. 
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Abstract 

This  paper  presents  a  general 
incremental  learning  scheme  :  a  single 
pneralization  algorithm  can  both 
learn  a  set  of  rules  from  a  set  of 
examples,  and  achieve  the  refinement 
of  a  previous  set  of  rules.  This 
approach  is  based  on  a  redescription 
operator  called  reduction  :  from  a  set 
of  examples  and  a  set  of  rules,  we 
derive  a  new  set  of  examples 
desaibing  the  behavior  of  the  rule  set. 
New  rules  are  extracted  from  these 
behavioral  examples  :  those  rules  can 
be  seen  as  meta-rules,  as  they  control 
previous  rules  in  order  to  improve 
their  predictive  accuracy. 


1.  Introduction. 

This  paper  deals  with  incremental  learning  from 
examples.  Two  kinds  of  incremental  processes 
are  distinguished : 

First,  an  algorithm  that  sequentially  handles 

the  examples  is  incremental  {Lebowitz  87}.  This 
sequential  incrementality  is  useful  in  case  of 
huge  amount  of  data,  in  order  to  avoid 
exponential  explosion. 

A  second  kind  of  incrementality  aims  at 
CTadually  refining  the  computed  knowledge  : 
for  instance,  learning  by  discovery  gradually 

learns  numerical  laws  from  examples  {Langley 
&  al.  1984,  1986  ;  Falkenhaimer  Michalski  1986}. 
In  neural  networks,  the  knowledge  encoded 

within  the  network  coefficients  is  also  gradually 
refined  and  optimized  {Rumelhart  &  al.  1986  ; 
Fogelman  &  al.  1986}.  This  iterative  incre¬ 
mentality  is  useful  when  the  search  space  is 
huge  (such  as  the  space  of  polynoms  of  several 
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untractable  by  direct  ways. 

Our  aim  is  to  study  a  general  learning  scheme, 
leading  to  both  sequential  and  iterative  incre¬ 
mentality.  A  2-steps  process  is  presented : 

•  In  a  first  step,  a  set  of  rules  is  learnt  from  a 
set  of  examples  by  a  generalization  algorithm. 


Michele  Sebag 
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The  only  assumption  we  make  is  that  this  algo¬ 
rithm  is  able  to  produce  deliberatly  under- 
optimal  solutions  (rules  are  redundant,  admit 
exceptions,...).  So,  the  learning  problem  is  given 
an  approximate  solution,  and  the  next  step  is  to 
refine  this  approximate  solution. 

•  The  second  step  performs  the  transition 

toward  another  learning  problem,  namely  the 
refinement  of  the  previous  set  of  rules.  This 
transition  is  performed  by  a  redescription 
opeator  called  reduction  :  given  a  set  of  rules 
and  a  set  of  examples  describing  the  learning 
domain,  we  derive  a  set  of  examples  descri¬ 
bing  the  behavior  of  the  rule  set  over  the 
learning  domain. 

In  this  new  context,  generalization  applies 
again  :  a  new  set  of  rules  is  learnt  in  order  to 
correct  previous  rules.  Successive  applications  of 

this  2-steps  process  {generalization,  reduction) 

allow  more  accurate  ana  more  complex  (because 
disjunctive)  rules  to  be  discovered  ;  moreover, 
this  process  can  sequentially  handle  subsets  of 
examples.  So,  both  sequential  and  iterative 

incrementality  can  be  achieved  by  this  learning 
process,  called  multi-layers  learning. 

We  outline  that  reduction  applies  whatever 
the  generalization  algorithm  :  it  transforms  the 
refinement  of  a  learning  output  into  another 
(boolean)  learning  problem. 

This  paper  is  organized  as  follows : 

In  section  2,  some  incremental  learning 
processes  are  briefly  reviewed.  We  discuss  some 
problems  raised  by  incrementality,  such  as  the 
convergence  of  these  iterative  techniques,  and 
the  stability  of  the  learning  output. 

Section  3  introduces  the  notion  of  symbolic 
approximatijM  ;  the  definition  of  an  appro¬ 
ximate  learning  output  refers  to  the  defects  of  a 
knowledge  base,  as  stated  in  {Rajamoney  DeJong 
1987}.  The  reduction  operator  is  then  given  :  it 
enables  to  describe  the  behavior  of  a  rule  set 
over  a  learning  domain. 

Successive  applications  of  generalisation- 
reduction  can  be  done.  But  the  worthiness  of 
this  iterative  process  depends  on  the  ability  of 
the  generalisation  to  produce  under-optimal 
solutions,  in  terms  of  conciseness  and  exactness. 
The  end  of  section  3  presents  a  generalisation 
algorithm  that  fullfills  these  requirements,  and 
that  we  used  to  experiment  our  approach. 
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Finally,  two  applications  of  the  reduction  are 
detailed  in  section  4  :  reduction  allows  to  gradu¬ 
ally  learn  a  complex  knowledge  from  the  whole 
set  of  examples.  It  also  allows  to  sequentially 
learn  from  subsets  of  a  large  learning  set. 

2  Examples  Of  Incremental  Processes 

This  section  does  not  attempt  to  give  an  exhaus¬ 
tive  state  of  the  art  (see  for  instance  {Kodratoff 
1989}).  Our  purpose  is  to  point  out  on  some  well- 
known  learning  processes,  which  problems  are 
solved  and  which  are  raised  by  incrementality. 

Sequential  incrementality  mainly  addresses  the 
problem  of  exponential  explosion  :  examples  are 
handled  one  by  one,  as  a  direct  global  learning  is 
impossible  for  material  reasons. 

Iterative  mcrementalitv  applies  when  no  direct 
learning  algorithm  is  known  :  a  good  solution  is 
gradually  built  without  any  new  external 
information. 

2.1  Sequential  Incrementality. 

At  each  step  of  a  sequential  learning  process, 
current  knowledge  evolves  by  considering  one 
new  example.  We  focus  on  the  generality  of  this 
current  knowledge,  and  more  precisely,  we 
consider  the  evolution  of  this  generality  during 
the  learning  process. 

2.1.1  Monotonous  Learning. 

If  this  generality  is  monotonous  (increasing  or 
decreasing,  the  process  is  said  to  be  monotonous. 
An  instance  of  monotonous  incrementality  is 
given  by  the  Version  Space  {Mitchell  1982}.  We 
briefly  recall  the  main  features  of  this  famous 
approach.  The  space  of  the  most  specific  versions 
of  the  concept  to  learn,  (respectively  of  the  most 
general  versions)  denoted  by  S  (resp.  G),  is 

initialized  to  "nothing"  (resp.  "ever^hing").  S  is 
then  continuously  generalized  to  cover  new 

positive  examples  of  this  concept  while  G  is 

continuously  specialized  in  order  to  reject  new 
negative  examples.  G  and  S  are  ensured  to  be 

equal  as  soon  as  enough  exact  positive  and 
negative  examples  have  been  encounteredi. 

Note  that  this  monotonous  approach  docs  not 
allow  any  revision  of  the  knowledge  :  as  this 
process  involves  no  backtrack,  no  error  is 

allowed.  The  process  is  ensured  to  converge 
but  the  quality  of  the  result  is  guaran¬ 

teed  only  if  the  data  are  error-free. 

Another  example  of  incremental  monotonoii.s 

rocess  is  the  incremental  acquisition  of  concepts 

y  analogy  {Vrain  Lu  1988}.  Similarly,  the  gene¬ 
rality  of  tne  current  knowledge  continuously 

increases,  and  same  remarks  about  convergence 
and  error  apply. 


iThe  Version  Space  is  adapted  to  the  characterization 
of  conjunctive  concepts  ;  moreover  it  can  be  adapted  to 
disjunctive  concepts  as  shown  in  {Manago  1988}. 


2A2  Non  Monotonous  Learning. 

An  instance  of  non  monotonous  learning  process 

is  the  incremental  conceptual  classification 
(Decaestecker  1989}.  This  approach  gradually 
builds  a  tree  of  concepts  ;  the  examples  are 
considered  one  by  one.  Four  operators  are  used  : 
creation/deletion  of  a  node,  splitting  of  a  node 
into  several  nodes,  fusion  of  several  nodes  into 
one.  It  is  clear  that  these  operators  are  inversi- 
ble:  hence  the  learning  process  may  undo  what 

was  done,  delete  a  node  previously  created,  etc. 

This  process  is  tneoretically  reversible  :  it 
shoud  better  handle  noisy  domains  than  monoto¬ 
nous  processes.  On  the  other  hand,  it  follows 

from  Its  reversibility  that  it  is  not  ensured  to 
converge  :  oscillations  of  the  knowledge  (here 
the  tree  of  concepts)  may  appear.  This  problem  is 
solved  via  statistical  critenons.  The  operators 

(creation,  deletion,  etc...)  are  trigg-ered  only 
when  some  numerical  thresholds  are  reached  ; 

those  thresholds  depend  on  past  examples. 

The  consequence  is  that,  as  process  goes  on  and 
thresholds  increase,  the  examples  have  less  and 
less  effects  on  the  structure  of  the  tree:  they 

induce  only  peripheral  evolution  of  the  know¬ 

ledge.  In  other  words,  the  learning  output  may 
depend  on  the  order  of  the  learning  input.  This 
can  be  seen  as  the  inertia  of  the  learning  process 
{Cornuejols  1989}. 

In  short,  if  non  monotonous  learning  is  more 

adapted  to  noisy  domains,  it  is  not  guaranteed  to 
converge.  A  way  to  ensure  convergence  is  to  gra¬ 
dually  decrease  the  influence  of  examples,  that 
is  increase  the  inertia  of  the  learning  process. 

2.2  Iterative  Incrementality. 

The  main  concern  of  this  type  of  incremental 
learning  is  to  gradually  provide  more  accurate 
knowledge.  A  learning  phase  includes  two  steps  : 
first,  current  knowledge  is  elaborated  ;  second 
step  achieves  the  transition  to  the  next  learning 

phase. 

2.2.1  Neural  Networks. 

In  neural  networks,  the  knowledge  is  encoded  by 
numerical  coefficients  {Rumelhart  &  al.  1986, 
Fogelman  &  al.  1986}.  The  coefficients  are 

gradually  optimized  ;  examples  are  taken  into 
account  in  a  cyclic  way2.  This  process  fits  into 

the  above  scheme  :  the  first  step  computes  new 
values  for  the  coefficients  from  current  values 
and  current  example.  The  transition  step  just  sets 
the  coefficients  to  the  computed  values. 

In  this  case,  incrementality  is  a  purely 
mathematical  method,  which  ensures  the  almost 
sure  convergence  of  the  process. 


2.  For  this  reason,  neural  networks  have  been 
classified  as  iterative  systems  :  after  a  while  they 
handle  no  new  information,  strictly  speaking. 
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111  Learning  By  Discovery. 

Learning  by  discovery  gradually  discovers 
niunerical  laws  from  examples  {Langley  &  al. 
1984,  1986  ;  Falkenhainer  &  Michalski  1986}. 
For  instance,  one  discovers  the  law  FV  =nRT 
from  examples  of  perfect  gaz  described 
according  to  their  pressure  P,  their  volume  V, 
their  temperature  T  and  the  number  of  moles  n. 

Data  and  knowledge  are  represented  using  two 
levels  :  the  first  level  is  a  set  of  descriptors  (P,  V, 
n,  T)  ;  the  second  level  is  a  set  of  examples, 
which  are  points  belonging  to  this  description 
space  (here  in  4  dimensions). 

In  the  first  step,  some  descriptors  are  built  {PV, 
in  the  example  above)  because  of  their 
steadiness  on  some  subset  of  examples  ;  those 
new  descriptors  are  functions  of  the  prewous 
descriptors,  and  hence  are  computable  on  the  set 
of  examples. 

In  the  transition  step,  the  set  of  descriptors  is 
updated,  and  so  is  the  description  of  the 
examples.  We  now  have  examples  belon^ng  to 
the  5  dimensions  space,  described  by  (P,  V,  n,  T, 
PV)f  and  a  next  learning  phase  is  possible, 
leading  to  PV/T,  and  finally  to  PV/nT, 

The  process  is  data-driven  ;  the  descriptor 
PV/nT  induces  a  law,  as  it  takes  the  constant 
value  8.32  =  R  on  the  data  (if  only  perfect  gaz 
are  described).  In  this  case,  incrementality  allows 
to  heuristically  explore  a  huge  search  space,  the 
polynomial  fractions  of  several  variables  and  of 
any  degree. 

2.3  Some  Remarks. 

On  sequential  learning  :  we  notice  that  the 
processes  we  have  reviewed  only  handle  one 
example  at  the  time.  For  the  sake  of  robustness, 
a  (small)  subset  of  examples  could  be  considered 
in  each  learning  step  :  the  effects  of  noise  would 
be  diluted. 

On  iterati..  learning  :  finding  a  solution  and 
refining  this  previous  solution  are  achieved  by  a 
single  algorithm.  This  approach  relies  on  the 
transition  step  which  transforms  the  initial 
problem  by  use  of  current  knowledge.  This 
method  is  well-known  in  the  field  of  arithmetic 
calculus  ;  so  it  is  not  surprising,  if  it  is  applied 
to  learn  numerical  knowledge.  In  the  symbolic 
field,  the  question  is  ;  what  could  be  the 
semantics  of  such  a  transition  ?  How  examples 
could  be  transformed  using  current  knowledge, 
in  order  to  refine  this  current  knowledge  ? 

This  paper  attempts  to  answer  these  remarks:  a 
transition  mechanism,  inspired  from  the  learning 
by  discovery,  is  adapted  to  the  learning  from 
examples  (Michalski  1984 ;  Kodratoff  1988}. 


3  Learning  And  Approximation. 

Our  aim  is  to  use  the  same  generalization 
algorithm,  both  to  find  a  solution  at  one 
learning  phase,  and  to  refine  this  solution  at  the 
next  learning  phase. 


A  .colution  is  to  be  refined,  only  if  it  is  an 
.  ^ximate  solution  :  we  first  discuss  what  we 
mean  by  approximation  (§3.1).  The  transition  we 
propose  from  one  learning  phase  to  the  another 
IS  detailed  in  §3.2.  Conditions  of  application  are 
studied  in  §3.3.  We  outline  that  the  presented 
approach  could  work  with  any  generalization 
algorithm  satisfying  hypotheses  §3.3.  But  we 
briefly  recall  the  generalization  we  used  to 
illustrate  the  value  of  our  approach  (more 

details  may  be  found  in  (Sebag  Schoenauer 
1988, 1989}). 

3.1  Symbolic  Approximation. 

A  solution  to  a  learning-from-examples  problem 
is  provided  by  a  set  of  discriminant  rules 
(Quinlan  1986  ;  Michalski  1986}.  We  suppose 
this  solution  needs  to  be  refined,  and  call  it  an 
approximate  solution.  This  notion  refers  to  the 
defects  of  a  knowledge  base,  as  stated  by 
{Rajamoney&  DeJongl987}. 

For  instance,  let  a  very  approximate  knowledge 
base  be  ^ven  by  the  four  rules : 
ri:  If  I  walk  on  a  snake,  danger 
T2‘.  If  I  see  3  black  crows,  danger 
n:  If  I  walk  on  a  grass  snake,  no  danger 
T4 :  If  I  have  my  mascots,  no  danger. 

This  knowledge  base  has  all  kinds  of  defects: 

•  It  is  incomplete,  as  no  conclusion  is  delivered 
(no  rule  is  fired)  if  I  see  only  2  black  crows. 

•  It  is  inconsistent,  as  contradictory  conclusions 
are  delivered  if  7  walk  on  a  cobra  and  I  have  my 
mascots  (ri  concludes  to  danger  and  r4  concludes 
to  no  danger). 

•  It  includes  errors  :  no  danger  if  I  walk  on  a 
dead  snake,  in  spite  of  rule  ri. 

•  It  is  redundant  :  7  may  both  walk  on  a  snake 
and  see  3  black  crows. 

An  approximate  rule  set  is  just  a  set  of 
incomplete,  inconsistent,  erroneous  and/or 
redundant  rules.  Notice  these  defects  are  not 
independent  :  when  the  number  of  rules 
increases,  then  the  incompleteness  of  this  rule 
set  is  likely  to  decrease.  But  unfortunately 
inconsistency  and  errors  are  likely  to  increase  in 
this  case  :  the  chance  for  an  example  to  fire  at 
least  one  rule  increases,  but  the  chance  to  fire 
two  contradictory  rules  increases  too. 

Many  recent  works  attempt  to  find  a  balance 
between  incompleteness  and  inconsistency  ;  see 

(Quinlan  1986,  Michalski  &al.  1986,  Clark 

Niblett  1987,  Ganascia  1987}  for  generalization 
of  rules  with  exceptions  :  see  (Quinqueton  1983; 
Ccstnik  &  al.  1987 ;  (jams  1988} ;  see  also  §3.4. 

From  the  incrementality  point  of  view, 

inconsistency  seems  to  be  a  more  tractable  defect 
than  incompleteness.  As  a  matter  of  fact,  when 
an  example  fires  no  rule,  no  supplementary 
information  about  this  example  is  provided  by 
previous  learning  results.  On  the  opposite  when 
an  example  fires  contradictory  rules,  some 

effective  work  is  made,  though  not  yet 

sufficient.  New  information  is  given  about  this 
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example  (the  list  of  tired  rules),  and  a  next 
learning  problem  can  be  set  :  learning  to  solve 
the  previous  rules  inconsistencies. 


32  Resolution  Of  Inconsistencies. 

3.2.1  Numerical  Resolution. 

A  simple  way  to  solve  inconsistencies  is  nume¬ 
rical  :  the  final  conclusion  is  ^ven  by  a  majority 
vote,  where  every  fired  ride  votes  for  its 
conclusion.  This  method  can  be  refined  by 
weighting  the  rules.  Statistical  criterions  can  be 
used  in  order  to  optimize  the  reliability  of  this 
decision  process  (Lerman  1989,  Perron  1989}. 

Obviously,  weights  give  a  partial  order  on  the 
rule  set.  However,  this  numerical  representation 
of  priorities  is  not  adapted  :  priority  is  not 
necessarily  a  transitive  property  (a  rule  rl  may 
be  preponderant  over  a  rule  r2,  which  may  be 
preponclerant  over  a  rule  r3,  which  may  in  turn 
be  preponderant  over  rule  rl...).  More  generally, 
priorities  of  rules  heavily  depend  on  the  con¬ 
text  :  a  symbolic  treatment  should  better  suit 
our  purpose. 

322  Symbolic  Resolution. 

Let  us  see  how  this  can  be  done,  using 
generalization  again. 

Let  us  go  back  to  our  approximate  rule  set 
about  snaxes  and  danger,  and  consider  an 
example  described  by : 

I  walk  on  a  cobra,  and  it  is  full  moon,  and  / 
have  my  mascots  (if  the  conclusion  is  known,  it 
is  likely  danger). 

The  above  description  may  be  transfr  *  med 
according  to  the  rule  set.  It  gives : 

■  ri  applies  (I  walk  on  a  snake) 

•  r3  does  not  apply  (I  do  not  walk  on  a  ff-ass 
snake) 

■  r4  applies  (I  have  my  mascots). 

So,  from  an  example  of  the  learning  domain, 
(with  known  description  and  conclusion)  an 
example  of  the  behavior  of  the  rule  set  is 
derived : 

(description)  ri  applies,  r^  does  not,  r4  does, 

(conclusion)  danger. 

From  such  new  examples,  new  rules  can  (and 
will)  be  learnt,  for  instance  : 

If  n  applies  and  not  n,  then  danger  (in  other 
words,  if  /  walk  on  a  snake  which  is  not  a  grass 
snake,  danger). 

Remark  :  this  redescription  mechanism  provides 
an  alternative  to  asking  the  experts  to  solve 
inconsistencies.  As  the  original  conclusion  of  the 
example  remains  unchanged  by  the  redes¬ 
cription,  we  now  have  examples  of  conflicts 
arising  among  rules,  together  with  the  actual 
conclusion.  So,  resolution  of  inconsistencies  is 
learnt  from  reduced  examples  by  the  system, 
rather  than  asked  to  the  experts. 


3.23  Definition  of  Reduction. 

Let  us  more  formally  define  the  redescription 
above.  Let  fl  denote  the  current  description 
space  of  the  learning  domain,  and  B  be  a  set  of 
rules  expressed  within  fl.  We  wish  not  to  depend 
on  the  actual  formalisation  of  fl  (boolean  logic, 
multi-valued  propositionnal  logic,  predicate 
logic,  symbolic  objects,...!. 

We  only  assume  tnat  for  any  rule  ei^ressed 
within  fl,  for  any  example  expressed  within  fl, 
we  are  able  to  know  if  this  example  fires  this 
rule,  i.e.  if  the  description  of  the  example 
satisfies  the  premises  of  the  rule  (this 
requirement  is  fulfilled  if  the  rule  is  to  be  of 
any  use).  In  other  words,  a  rule  defines  a 
boolean  descriptor  on  fl  :  for  any  example,  the 
premises  of  the  rule  are  or  aren’t  satisfied. 


Definition  1. 

The  reduction  is  defined  according  to  a  mle  set 
B,  and  denoted  TTB.  The  reduction  is  a 
redescription  operator  defined  from  fl  into  the 
boolean  space  of  dinwtsion  L,  if  L  is  the 


number  of  mles  in  B. 
TTBtfl  - 
1[B:sG  fl 
where  i 
rj(s)  = 


>  {0, 1}*^ 

>  7rB(5)  =  {rj(5),j  =  1..L} 
rj  is  given  by : 
lifs  satisfies  the  premises 
of  the  j-th  mle  in  B 
0  otherwise. 


This  redescription  allows  to  transform  any 
example  set  A,  expressed  within  fl.  An  example 
in  A  is  given  by  its  description  s  belonging  to  fl, 
and  its  conclusion  Conc(s). 


Definition  2 

From  the  learning  set  A  according  to  the  mle  set 
B,  the  learning  reduced  set  denoted  Ab  is  gfven 
by: 

Ab  =  {  (7rB('Si),  Conc{si))  /  (5„  Conc{si))  S  A) 


The  reduced  learning  set  Ab  describes  the 
behavior  of  B  on  the  examples  in  A  ;  it  is 
expressed  in  boolean  logic,  whatever  the  initial 
representation  of  A  and  B.  The  next  move  is 
obviously  to  generalize  the  reduced  learning  set 
Ab  ;  but  it  requires  the  reduced  set  to  be  a 
valuable  learning  set. 

33  Conditions  Of  Application. 

A  learning  .set  is  valuable  if  it  enables  to  learn  a 
good  set  of  rules  :  a  learning  set  is  found  to  be 
valuable  a  posteriori.  A  prion,  one  could  just  say 
that  a  valuable  learning  set  ought  to  be 
sufficiently  consistent,  and  to  include  enough 
information  ;  those  characteristics  are  vague 
indeed. 

So  we  have  to  suppose  that  the  initial  learning 
set  A  does  fulfill  those  good  properties.  The 
question  now  is  :  does  the  reduced  learning  set 
still  have  those  properties,  or,  in  other  words, 
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how  does  the  reduction  affect  the  consistency 
and  the  quantity  of  information  of  the  data  ? 

Notice  that  the  reduction  is  not  necessarily 
injective  :  distinct  examples  can  have  the  same 
r^escription  through  TTB.  So  reduction  can 
create  inconsistencies,  if  examples  of  distinct 
conclusions  are  given  the  same  image. 
Coiisistency  is  easy  to  check  :  two  examples  of 
distinct  conclusions  must  be  discriminated  by  at 
least  one  rule  in  B,  i.e.  there  must  exist  a  rule 
which  covers  exactly  one  of  both  examples. 

The  number  of  examples  in  the  reduced 
learning  set  Ab  is  less  than  the  number  of 

examples  in  the  initial  learning  set  A.  But  the 
reduced  learning  set  must  still  contain  enough 
information,  in  order  to  enable  a  further 

learning  phase  :  the  number  of  examples  should 
not  decrease  too  much.  (This  rough  criterion 
about  sufficiency  should  be  refined  with  respect 
to  the  size  of  the  description  space). 

One  shows  that  both  conditions  are  fulfilled  if 
the  rule  set  is  sufficiently  redundant  :  as  the 
number  of  rules  increases,  the  reduction  beomes 
injective  ;  it  follows  that  the  number  of 

examples  remains  and  that  reduction  causes  no 
inconsistency. 

In  the  following,  we  suppose  that  the  reducion 
JTB  preserves  the  consistency  and  the  amount  of 
information  of  the  learning  set  ;  this  is  ensured 
by  tuning  the  redundancy  of  the  rule  set  B,  as 
allowed  by  several  heuristics  of  generalization 

{Cestnik  &  al.  1987,  Gams  1988},  and  by  the 
generalization  below. 


3.4  Generalization  And  Heuristics. 

Our  approach  of  incrementality  does  not 
depend  on  the  generalization  algorithm 
rovided  that  sufficiently  redundant  rules  can 
e  learnt.  Nevertheless,  we  briefly  recall  here 
the  generalization  we  used,  as  the  whole  process 
generalisation-reduction  actually  works  with  it, 
as  will  be  presented  in  the  last  section.  A 

detailed  presentation  is  given  in  {Sebag 
Schoenauer  1988, 1989). 

3.4.1  Generalization  By  Elimination. 

The  following  algorithm  is  a  star-like 
generalization  {Michalski  1984),  which  handles 
multi-valued  propositionnal  logic.  Its  main 
orignality  is  a  logical  pruning  of  counter¬ 
examples,  based  on  the  near-miss  notion 
{Winston  1975,  Kodratoff  &  Loisel  1984). 

Lei  us  take  an  example  ;  a  toy  learning  seL 
about  riglit  and  wrong  quests  is  given  by ; 

Ex  1  :  My  name  is  Arthur,  and  my  favorite 

color  is  red,  and  I  am  good ;  quest  is  right. 

Ex  2  :  My  name  is  Triboulet,  and  my  favorite 

color  is  blue,  and  I  am  good ;  quest  is  wrong. 

Ex  3  :  My  name  is  Ganelon,  and  my  favorite 

color  is  yellow,  and  I  am  bad;  quest  is  wrong. 

The  star-like  generalization  handles  the 
examples  one  by  one  ;  from  one  example,  one 


attempts  to  find  manmally  discriminant  rules, 

rejecting  counter-examples. 

Let  us  consider  example  1  ;  we  restrict  to  the 
dropping  rule  generalization  for  the  sake  of 
simplicity.  We  want  to  know  which  conditions 
can  be  dropped  or,  as  well,  which  conditions 
cannot  be  dropped.  From  example  2,  it  follows 
that  conditions  My  name  is  Arthur  and  my 
favorite  color  is  red  cannot  be  dropped 
simultaneously  :  the  rule  If  I  am  good  then 
quest  is  right  does  not  reject  example  2. 

Given  example  2,  example  3  gives  no 
information  :  if  conditions  My  name  is  Arthur 
2£  my  favorite  color  is  red  are  kept,  it  ensures 

the  reject  of  example  3,  whose  name  is  not 

Arthur  and  whose  favorite  color  is  not  red.  In 
short,  if  example  2  is  discriminated  by  a  rule 

generalizing  example  1,  then  example  3  is 
discriminated  too  :  example  3  is  useless  for  the 
discriminant  generalization  purpose,  and  it  can 
be  pruned. 

More  formally,  let  s  be  an  example  of  an 

example  base  A.  Any  counter-example  t  of 
^ves  a  constraint  over  the  generalization  of  r 
the  descriptors  which  discriminate  t  from  s  c 
not  be  dropped  simultaneously.  The  constraint 
C(sf)  is  a  subset  of  integers,  given  by 
Cls,t)  =  (i  /  attribute  i  discriminates  s  and  /} 

We  say  a  counter-example  to  is  a  maamal  near- 
miss  to  ^  in  A  if  the  constraint  C(s,to)  is  minimal 
for  the  set  inclusion,  among  all  C(s,t),  t  counter¬ 
example  of  A. 

We  then  search  all  minimal  subsets  of  integers 
M,  such  that  M  intersects  every  constraint  C{s,t). 
From  such  a  set  M,  a  rule  rs,M  is  defined  as 
follows  ;  its  premises  are  the  conjunction  of  all 
conditions  in  s  regarding  attributes  in  M  ;  its 
conclusion  is  the  conclusion  of  s.  We  prove  that 
rs,M  is  a  maximally  discriminant  generalization 
of  s  :  by  construction,  for  anjj  counter-example  t 
discriminated  from  s,  there  exists  an  element  in 
C(s,t)  which  belongs  to  M  :  the  corresponding 
attribute  allows  to  discriminate  s  and  t  ;  this 
condition  is  kept  from  s  to  rs,M,  hence  rs,M  still 
discriminates  t. 

The  search  of  subsets  M  is  achieved  by  a  graph 
exploration,  exponential  with  respect  to  the 
number  of  constraints.  However,  it  is  enough  for 
a  subset  M  to  intersect  all  constraints  C(s,t),  for  t 
maximally  near-miss  to  s.  This  generalization  by 
elimination  so  reduces  the  size  of  exponential 
exploration  by  a  preliminary  (polynomial) 
pruning. 

3.42  Heuristics. 

Two  problems  are  raised  by  the  above  algorithm. 

First,  it  finds  out  perfectly  discriminant  rules  ; 
this  is  useless  and  undesirable  if  the  example  set 
is  noisy,  as  stated  in  {Clark  Niblett  1987}. 

Second,  many  maximally  discriminant  rules  are 
found  and  some  selection  is  required. 

Four  heuristic  parameters  are  introduced  to 
overcome  these  problems.  Those  heuristics  are 
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parameterized  by  the  expert,  according  to  the 
features  of  available  data : 

•  The  exception  rate  £  is  related  to  the  noise 

of  the  data.  A  uniform  forgetting  rate  of  £  is 
applied,  when  constraints  are  built  and  explored 
to  find  disaiminant  rules.  By  this  way,  rules 
which  admit  exceptions  (do  not  reject  counter¬ 
examples),  may  be  found.  One  requires  the  ratio 

of  exceptions  for  a  rule  to  be  less  than  £ 
(criterion  1).  £  belongs  to  [0, 1]. 

■  The  significance  threshold  a  is  related  to 
the  redundancy  of  the  d.Ma.  One  requires  a  rule 
to  generalize  more  than  tt  examples  (criterion 
2).  a  is  an  integer. 

■  The  redundancy  rate  T  is  related  to  the 
sufficient  of  the  data.  The  idea  is  that,  if  data 
are  sufficient,  only  the  'best"  rules  should  be 
kept  ;  one  then  requires  the  number  of 
examples  covered  by  a  rule  r  to  be  the  maximal 
number  of  examples,  covered  by  a  rule 

generalizing  the  current  example  s.  But  this 

criterion  leads  to  instability  when  data  are 
insufficient.  So  we  require  the  number  of 
covered  examples  to  be  greater  than  the  product 
of  T  by  this  maximal  number  of  examples 
covered  by  a  rule  generalizing  s  (criterion  3).  T 
belongs  to  [0,1]. 

•  Finally,  a  more  technical  parameter,  the 
patience  rate  F  (like  Forget),  enables  to  control 
the  computational  effort.  When  a  given  number 
of  failures  consecutively  occur  during  the  gene¬ 
ralization  of  an  example  (a  rule  fails  because  of 
criterion  1,  2  or  3),  then  the  exploration  is  spee¬ 
ded  up  by  randomly  forgetting  of  constraints. 

Criteria  1  and  2  are  largely  used  in  the 
litterature;  criterion  3  is  also  well  studied. 
Criterion  4  is  inspired  from  simulated  annealing 
{Davis  Steenstrup  1987}  (the  patience  of  the 
learner  stands  for  the  temperature  of  the 
cristal);  as  far  as  we  know,  this  is  a  new  feature 
to  generalization  area. 


4.  Multi-Layers  Learning. 

In  this  section,  we  define  a  general  incremental 
learning  scheme  which  uses  two  operators, 
generalization  and  reduction  (§4.1).  This  scheme 
allows  to  ^adually  learn  more  accurate  and 

complex  (disjunctive)  rules  (§4.2).  It  also  allows 
to  sequentially  handle  the  examples  (§4.3).  Some 
comparative  results  are  presented  on  a  well- 
studied  learning  set  (Michalski  &  al.  1986,  Clark 
Nibleii  1987,  Cesinik  &  ai.  1987}. 

4.1  A  Learning  Layer. 

The  reduction  gives  from  a  learning  set  A  and  a 

rule  set  B,  a  new  learning  set  Ab.  On  the  other 

hand,  many  generalization  algorithms  provides  a 
rule  set  B  from  a  learning  set.  Reduction  and 
generalization  may  then  be  seen  as  operators, 

and  combined  ;  the  2-steps  process  (genera¬ 
lization  of  a  rule  set,  reduction  according  to  this 
rule  set)  is  called  a  learning  layer. 


4.1.1  Refinement  Of  Previous  Rules. 

We  claim  that  the  generalization  on  reduced 
learning  set  Ab  implicitly  achieves  the 
refinement  of  rule  set  B  : 

■  First,  if  a  rule  in  B  has  a  good  predictive 

accuracy,  this  information  is  implicitly  available 
from  the  reduced  learning  set  Ab.  As  a  matter 
of  fact,  if  a  rule  has  a  good  predictive  accuracy, 
it  is  often  fired  ;  the  reduced  descriptor  often 

takes  the  value  1,  and  in  this  case,  the  actual 

conclusion  is  often  equal  to  the  rule  conclusion. 
Hence,  there  is  a  correlation  in  Ab  between  a 

value  of  this  descriptor  and  a  value  of  the 
conclusion.  So  the  rule  will  be  discovered  again 
by  next  generalization.  This  process  is  stable,  as 
good  rules  in  B  are  kept. 

"  Second,  the  same  argument  ensures  that 

irrelevant  rule  are  dropped  :  if  a  rule  is 
irrelevant,  the  associated  reduced  descriptor  is 
irrelevant  with  respect  to  Ab  too.  As 

generalization  is  supposed  to  detect  and  forget 

irrelevant  descriptors,  this  guarantees  that  the 
rules  learnt  from  Ab  do  not  mention  previous 
irrelevant  rules. 

■  Third,  generalization  discovers  links  among 

descriptors  and  conclusion.  In  the  reduced 

learning  set  Ab,  examples  are  described 
according  to  the  rules  in  B  they  trigger.  Hence, 
the  triggering  of  rules  can  be  generalized  from 

Ab  :  the  generalization  solves  conflicts  arising 
among  previous  rules. 

4.12  Approximation  Requirements. 

In  multi-layers  learning,  rules  have  a  double 
part  :  they  are  provided  by  a  learning  phase,  as  a 
solution  to  the  current  learning  problem  ;  then 
they  derive  descriptors,  and  at  next  phase  those 

descriptors  are  inputs  to  the  next  learning 

problem.  Desirable  properties  about  learning 
solution  and  means  are  rather  different  :  one 
wants  to  learn  concise  and  general  rules  ;  but  to 
this  aim,  one  needs  sufficient,  precise,  various 
descriptors...  From  an  incremental  point  of  view, 
we  obviously  prefer  to  enable  the  further 
learning  phase,  than  to  optimize  the  current 
learning  output  :  hence  in  a  learning  step,  the 
generalization  must  be  able  to  provide 
redundant  and  inconsistent  rules. 

This  is  a  general  strategy  of  approximation  :  as 
shown  in  genetic  algorithms  or  simulated 
annealing  (Holland  &  al.  1985},  non-optimal 
solutions  must  be  kept  especially  in  first  steps,  in 
order  to  avoid  local  optima. 

42  Iterative  Incrementallty. 

We  suppose  here  the  data  can  be  handled  by  the 
generalization  in  a  reasonnable  time  (which 
depends  on  the  available  machine,  the  available 
generalization  and  the  patience  of  the  experts...). 
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4JS.1  Outline. 

An  iterative  learning  is  performed  by  multi¬ 
layers  learning  as  follows : 

Learning  phase  1  : 

1.1-  Generalize  a  rule  set  Bi  from  the  initial 

learning  set  A. 

1.2-  Reduce  A  according  to  Bi ;  Ai  =  Abi. 

Learning  phase  2 : 

2.1-  Generalize^  a  rule  set  B2  from  reduced 

learning  set  Ai. 

2.2-  Reduce  Ai  according  to  B2 : 

A2  =  (Ai)B2  =  Ab1.B2 

Learning  phase  3: 

3.1-  Generalize  B3  from  A2,  and  so  on. 

The  only  condition  for  this  scheme  to  apply  is 

for  the  rule  set  Bi  to  be  sufficiently  redundant 
so  that  the  reduction  derives  a  sufficient  and 
consistent  learning  set  Ai  from  Ai-i  ;  as 

mentionned  in  §3.3,  the  generalization  must  be 
able  to  provide  sufficiently  redundant  rules. 

This  process  leads  to  a  sequence  of  rule  sets, 
called  multi-layers  rules  :  premises  of  rules  in 
Bi+i  are  conjunction  of  litterals,  those  litterals 
being  the  premises  of  rules  in  Bi,  or  negation 
of  theses  premises.  Rules  in  Bi+i  may  be 

seen  as  meta-rules  with  respect  to  Bi,  as  they 

perform  some  control  over  Bi : 

n :  Ifrt  and  /j  are  fired,  and  is  not,  then,... 

where  r\  belongs  to  Bi + 1, 

rj,  n  and  /■(;  belong  to  Bj. 

Obviously  rules  in  Bi  can  always  be  expressed 
with  respect  to  the  original  descriptors  used  in 
Bi  ;  however,  rules  in  Bi,  i  >  1,  may  be 

disjunctive  with  respect  to  the  original  descrip¬ 
tors,  because  negation  of  previous  conjunctive 
premises  are  used. 

We  point  out  that  there  is  no  generality 
relation  among  rules  in  Bi  and  rules  in  Ei+i, 

hence  this  learning  process  is  not  monotonous 
(see  §2.2.1)  ;  premises  of  rules  in  Bi+i  are  not 
necessarily  more  general  nor  more  specific  than 
premises  m  Bi. 

Evolution  of  this  process  is  both  expert-  and 
data-driven.  The  expert  adjusts  the  parameters  at 
each  generalization  step,  especially  the 
redundancy  rate  (this  adjustment  is  discussed  in 

the  following).  The  process  stops  when  the 
current  learning  layer  gives  no  improvement, 
with  respect  to  the  predictive  accuracy  on  a  test 
learning  set.  In  the  experimentations  we  made, 
predictive  accuracy  can  significantly  increase 
from  1  to  2-!ayers  rules,  can  still  increase  from  2 
to  3-iayers  rules,  and  then  usually  decreases  as 
the  number  of  layers  increases. 

422  Validation. 

Multi-layers  learning  (ML2)  has  been 
implemented  in  Language  C  on  a  Unix  work- 


3.  The  same  gcneralizaiion  algorithm  applies  than  in 
first  phase  ;  the  reduced  learning  set  is  expressed  in 
boolean  logic.  Any  generalization  algorithm  should  be 
able  to  cope  with  this  minimal  representation. 


Station.  It  uses  the  generalization  algorithm 
described  in  §3.4.  The  results  obtained  on  a  well- 
studied  medical  learning  set  (about  prognosis  of 
breast  cancer)  are  given  below.  The  reference 
results  are  predictive  results  obtained  by 
algorithms  AQR  {Michalski  &  al.  1986}, 

Assistant  1936  {Cestnik  &  al.  1987}  and  CN2 
{Clark  Niblett  1987}  ;  those  results  are  found  in 
the  latter  paper,  as  well  as  the  results  of  a 

simple  bayesian  classifier,  denoted  Bayes. 


Table  1 :  Reference  Results. 


Results  on : 
obtained  by 

Training  set 
(200  ex.) 

Test  set 
(86  ex.) 

Bayes 

70% 

65% 

AQR 

100% 

72% 

CN2 

72-76% 

70-71% 

Assistant  86 

85-92% 

62-68% 

The  results  of  ML2  are  estimated  the  same 
way.  A  «-layers  rule  set  is  said  to  be  optimal  if 
its  predictive  accuracy  on  the  test  set  is  maximal 
among  « -layers  rule  sets. 

Here  are  the  results  obtained  for  optimal  1- 
layer  rule  set  B,  for  optimal  2-layers  rule  sets 
(Cl,  C2)  and  for  optimal  3-layers  rule  sets  (Di, 
D2,  D3). 

Those  rule  sets  are  learnt  with  given 
parameters  of  exception  rate  £  and  redundancy 


Table  2 :  Optimal  Results  Of  Multi-Layers 
Learning 


Train,  set 

Test  set 

£ 

7 

B 

66% 

<52% 

.7 

.3 

C2 

93% 

75% 

.2 

.4 

Cl 

79% 

58% 

.5 

.3 

D3 

94% 

mo 

.3 

.7 

D2 

83% 

60% 

.4 

.7 

Di 

62% 

48% 

.7 

.6 

One  notices  that  the  best  n  +  i-layers  rule  set 
are  based  on  a  /j-layers  rule  set  which  is  not  the 
best  one  among  «-layers  rule  set. 

The  tuning  of  generalization  parameters  is 
exponential  with  the  number  of  layers.  But  some 
explorations  of  these  possibilities  leads  to  the 
following  (experimental)  conclusions  :  to  build 
an  optimal  n-layers  rule  set,  one  must  admit  a 
high  exception  rate  and  redundancy  at  the 
beginning  ;  generalization  must  gradually  look 
for"  more  concise  and  exact  rules.  As  in 
simulated  annealing  again,  optimality  require¬ 
ments  must  be  gradually  raised  to  finally  give  a 
good  solution. 
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43  Sequential  Incrementality. 

In  this  section,  we  suppose  the  learning  set  is 
very  large.  The  problem  is  then  to  avoid 
exponential  ej^losion.  Multi-layers  learning  can 
perform  sequential  incrementality  as  follows : 

-  Some  subsets  Ai,  .Al  are  extracted  from  the 
initial  learning  set  A  by  the  experts.  Those 
subsets  may  be  a  partition,  or  may  overlap. 

Learning  phase  1  : 

-  Generalization  of  learning  set  Ai  gives  a  rule 
set  Bi. 

-  All  subsets  Ai  are  reduced  according  to  Bi. 
One  sets : 

A?  =  (AObi 
Learning  phase  2  : 

-  Generalization  of  reduced  learning  set  a\ 
^ves  a  rule  set  B2. 

More  generally,  one  has  at  phase  j+l: 

Ai  =  (Ai)Bl.B2...Bj 

Bj+i  =  Generalization  (Aj+i). 

This  approach  of  sequentim  learning  allows  to 
consider  a  set  of  new  examples  at  each  learning 
phase  (see  §2.3),  and  the  size  of  these  sets  is  here 
controlled  by  the  expert. 

The  limitation  of  this  process  is  the  following: 
if  two  consecutive  subsets  are  too  different,  then 
rules  learnt  from  the  first  subset  fails  to  provide 
a  good  description  of  second  subset.  This  problem 
is  encountered  whenever  consecutive  learning 
subsets  are  non-homogeneous  ;  in  this  case,  some 
initial  descriptors  must  be  added  to  reduced 
descriptors,  in  order  to  preserve  consistency  of 
the  reduced  description.  This  is  done  by  e;^erts 
at  the  moment. 


4.4  Conclusion. 

We  state  a  general  incremental  learning  scheme, 
called  multi-layers  learning.  The  general  idea  is 
to  first  learn  an  approximate  rule  set,  rubbing 
the  possible  defects  of  the  learning  set  itself 
(noise,  incompleteness,...)  and  then  to  learn  to 
refine  a  previous  rule  set. 

This  approach  relies  on  the  reduction  operator, 
which  performs  the  transition  from  first  to 
second  learning  problem.  A  new  description  of 
the  learning  domain  is  derived  from  an 
approximate  rule  set.  Within  this  new  descrip¬ 
tion,  initial  examples  provides  examples  of  the 
behavior  of  the  rule  set  over  the  learning 
domain.  From  these  behavioral  examples, 
generalization  can  extract  new  rules,  which 
correct  defects  and  inconsistencies  of  previous 
rules.  A  sequence  of  rule  sets  is  so  gradually 
built. 

Only  one  assumption  is  made  about 
generalization  :  it  must  be  able  to  provide 

redundant  rules.  Under  this  requirement,  any 
generalization  can  be  used  together  with 

reduction  :  reduction  defines  boolean  learning 
problems,  which  can  alwaj^s  be  handled  by 
generalization  (boolean  lo^c  is  a  minimal 
representation). 


Multi-layers  learning  allows  to  gradually  learn 

more  accurate  rules.  Validation  on  a  well- 
studied  learning  set  shows  a  slight  predictive 
improvement,  compared  to  some  famous  non- 

incremental  learning  algorithms. 

This  process  also  enables  to  sequentially  handle 
subsets  of  examples.  Compared  to  handling  the 
examples  one  by  one,  this  sequential 
incrementality  is  more  flexible  (the  size  of  those 
learning  subsets  is  controlled  by  the  expert)  and 
suited  to  nois)'  domains. 

The  limitation  of  this  approach  comes  from  the 
roughness  of  the  reduction.  The  transition  to 
boolean  logic  is  a  bit  abrupt,  when  the  initial 
data  are  expressed  within  predicate  logic  :  many 
order  rules  are  required  to  give  a  sufficient 
0*’’  order  description.  An  extension  of  reduction, 
from  boolean  to  muiti-valued  propositionnal 

logic  {Michalski  1975},  is  planned. 
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Abstract 

Decision  trees  that  are  limited  to  testing  a 
single  variable  at  a  node  are  potentially  much 
larger  than  trees  that  allow  testing  multiple 
variables  at  a  node.  This  limitation  reduces 
the  ability  to  express  concepts  succinctly, 
which  renders  many  classes  of  concepts  dif¬ 
ficult  or  impossible  to  express.  This  paper 
presents  the  PT2  algorithm,  which  searches 
for  a  multivariate  split  at  each  node.  Because 
a  univariate  test  is  a  special  case  of  a  mul¬ 
tivariate  test,  the  expressive  power  of  such 
decision  trees  is  strictly  increased.  The  algo¬ 
rithm  is  incremental,  handles  ordered  and  un¬ 
ordered  variables,  and  estimates  missing  val¬ 
ues. 

1  Introduction 

For  inductive  learning,  decision-tree  methods  are  at¬ 
tractive  for  three  principal  reasons.  First,  the  meth¬ 
ods  find  trees  that  generalize  well  to  the  unobserved 
instances,  assuming  that  the  instances  are  described 
in  terms  of  features  that  are  correlated  with  the  target 
concept.  Second,  the  methods  are  efficient,  generally 
requiring  a  total  amount  of  computation  that  is  pro¬ 
portional  to  the  number  of  observed  training  instances. 
Finally,  the  resulting  decision  tree  provides  a  represen¬ 
tation  of  the  concept  that  appeals  to  humans  because 
it  renders  the  classification  process  self-evident. 

This  paper  presents  a  new  decision-tree  algorithm, 
named  PT2,  that  is  designed  to  provide  a  richer  space 
of  possible  tests  at  a  node  and  to  provide  a  uniform 
treatment  of  ordered  and  unordered  variables.  Sec¬ 
tion  2  lays  out  the  issues  that  motivated  the  design  of 
PT2,  which  is  presented  in  Section  3.  Following  this. 
Section  4  illustrates  the  algorithm  on  three  steindard 
learning  tasks.  Finally,  Section  5  draws  conclusions 
about  the  algorithm  and  identifies  new  problems  that 
require  further  research. 

2  Issues  for  Decision-Tree  Induction 

This  section  discusses  the  principal  issues  that  moti¬ 
vated  the  design  of  the  PT2  algorithm.  These  issues 
arose  from  attempting  to  extend  the  perceptron  tree 


algorithm  (Utgoff,  1988)  to  handle  ordered  variables, 
but  are  central  to  research  on  decision-tree  induction 
in  general. 

2.1  Splitting  Criterion 

A  significant  shortcoming  of  decision-tree  algorithms 
such  as  ID3  (Quinlan,  1986)  is  that  the  space  of  le¬ 
gal  splits  at  a  node  is  impoverished.  A  split  (Morct, 
1982)  is  a  partition  of  the  instance  space  that  results 
from  placing  a  test  at  a  decision  node.  Each  subset  of 
the  partition  corresponds  uniquely  to  one  outcome  of 
applying  the  test  to  the  instance.  ID3  and  its  descen¬ 
dants  only  allow  testing  a  single  variable  (attribute) 
and  branching  on  the  outcome  of  that  test. 

In  order  to  facilitate  generalization,  one  would  like 
to  avoid  a  large  number  of  branches  at  a  node.  This 
means  that  a  variable  should  not  have  a  large  number 
of  possible  values.  This  is  typically  the  case  for  non¬ 
numeric  variables.  However,  numeric  variables,  both 
continuous  and  integer-valued,  are  problematic  due  to 
the  unlimited  range  of  possible  values.  A  more  funda¬ 
mental  distinction  among  variables  is  whether  a  vari¬ 
able’s  values  are  ordered  or  not.  Accordingly,  an  or¬ 
dered  variable  is  one  whose  possible  vedues  are  totally 
ordered.  This  class  includes  continuous  and  integer¬ 
valued  variables.  An  unordered  variable  is  one  whose 
values  are  not  totally  ordered.  One  strmdard  technique 
for  handling  an  ordered  variable  with  a  large  number 
of  possible  values  is  to  map  its  values  onto  a  small 
set  of  intervals.  Another  technique  is  to  partition  the 
values  of  a  numeric  variable  into  two  open-ended  inter¬ 
vals:  those  Vedues  that  are  greater  than  a  dynamicedly 
determined  constant  and  those  that  are  not  (Breiman, 
Friedman,  Olshen  &  Stone,  1984;  Quinlan,  1987). 

In  general,  one  would  like  to  allow  a  richer  space  of 
possible  splits  than  those  afforded  by  testing  just  one 
variable  at  a  time  (Pagallo  &  Haussler,  1988;  Pagallo, 
1989).  For  example,  eis  shown  in  Figure  1,  if  the  con¬ 
cept  to  be  learned  is  the  set  of  points  in  the  half-plane 
{(®>  y)\y  <2x  +  3},  then  a  decision  tree  based  on  uni¬ 
variate  tests  must  approximate  it  with  a  disjunction 
of  quarter-planes,  e.g.  {(z,  2/)|(a!  >  -1  A  y  <  1)  V  (z  > 
1  A  y  <  5)}.  This  example  illustrates  the  well  known 
problem  that  a  univariate  test  can  only  split  a  space 
with  a  boundary  that  is  orthogonal  to  that  variable’s 
axis.  This  limits  the  space  of  regions  in  the  instance 
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space  that  can  be  represented  succinctly,  which  can 
result  in  a  large  tree  and  poor  generalization  to  the 
unobserved  instances.  The  bias  imposed  by  allowing 
only  univariate  tests  may  be  inappropriate  for  a  target 
concept. 

2.2  Incremental  Tree  Revision 

For  learning  tasks  in  which  training  instances  are  pro¬ 
vided  serially,  one  would  like  to  be  able  to  revise  the 
existing  decision  tree,  if  necessary,  instead  of  rebuild¬ 
ing  it  from  scratch.  One  does  not  want  a  new  train¬ 
ing  instance  to  render  previous  learning  obsolete.  In 
particular,  one  would  like  to  be  able  to  conduct  the 
search  for  a  multivariate  test  incrementally.  Several 
groups,  including  Qing-Yun  and  Fu  (1983),  Breiman 
et  cd.  (1984),  Pagallo  and  Haussler  (1989),  Cicirk  and 
Niblett  (1989),  and  Chan  (1989)  have  devised  methods 
for  searching  for  multivariate  tests,  but  these  methods 
are  not  incremental. 

2.3  Sequential  Testing  and  Understandability 

There  are  two  important  advantages  of  decision-tree 
classifiers  that  one  does  not  want  to  lose  by  permitting 
multivariate  splits.  First,  the  tests  in  a  decision  tree 
are  performed  sequentially  by  following  the  branches  of 
the  tree.  Thus,  only  those  variables  that  are  required 
to  rectch  a  decision  are  evaluated.  On  the  cissump- 
tion  that  there  is  some  cost  in  obteiining  the  value  of 
a  variable,  it  is  desirable  to  test  only  those  variables 
that  are  needed.  Second,  a  decision  tree  provides  a 
clear  statement  of  a  sequential  decision  procedure  for 
determining  the  classification  of  an  instance.  A  small 
tree  with  simple  tests  is  most  appeding  because  a  hu¬ 
man  can  understand  it.  There  is  a  tradeoff  to  consider 
in  allowing  multivariate  tests;  simple  tests  may  result 
in  large  trees  that  are  difficult  to  understand,  yet  mul¬ 
tivariate  tests  may  result  in  small  trees  with  tests  that 
are  difficult  to  understand. 

2.4  Mixing  Variable  Types 

One  of  the  well-known  problems  with  decision-tree 
methods  is  how  to  handle  ordered  variables.  As  dis¬ 
cussed  above,  such  variables  are  problematic  because 
they  may  have  many  possible  values,  e.g.  a  continu 
ous  variable.  For  a  decision  tree,  one  requires  variables 


with  a  small  number  of  possible  values,  so  that  the 
number  of  possible  branches  at  a  node  is  kept  small. 
Techniques  exist  for  mapping  an  ordered  variable  to 
an  unordered  variable,  thereby  reducing  the  number 
of  possible  values  for  the  ordered  variable  from  many 
to  as  few  as  two.  These  techniques  are  special  cases 
of  the  more  genereil  approach  of  partitioning  an  n- 
dimensional  Euclidean  space  into  regions,  defined  by 
inequalities.  For  example,  a  region  in  x-y  space  might 
be  a  disc  defined  by  {(a;,y)|x^  +  <  3}.  In  general, 

one  can  map  an  n-dimensional  point  to  a  region  and 
then  treat  containment  within  the  region  eis  a  two¬ 
valued  unordered  variable. 


How  can  one  mix  ordered  and  unordered  variables 
freely?  There  are  two  standard  approaches.  First,  as 
already  discussed,  one  can  map  each  ordered  variable 
to  an  urordered  variable,  and  then  find  a  Boolean  com¬ 
bination  that  represents  the  concept.  The  principal 
problem  with  this  approach  is  that  Boolean  combina¬ 
tions  of  intervals  can  only  define  regions  via  bound¬ 
aries  that  are  each  orthogonal  to  one  of  the  coordinate 
axes.  This  renders  the  class  of  concepts  that  require 
other  boundary  orientations  difficult  to  represent.  Al¬ 
ternatively,  one  can  map  each  variable,  ordered  or  un¬ 
ordered,  to  a  numeric  variable  and  then  find  a  nu¬ 
merical  combination  that  represents  the  concept.  This 
latter  approach  allows  an  accurate  and  succinct  rep- 
resention  for  a  large  class  of  concepts.  For  decision- 
tree  induction,  one  would  like  to  be  able  to  consider 
boundaries  in  any  orientation  and  then  form  regions 
by  splitting  the  space  in  terms  of  these  boundaries. 

In  order  to  map  an  unordered  variable  to  a  numeric 
variable,  one  needs  to  be  careful  not  to  impose  an  or¬ 
der  on  the  values  of  the  unordered  variable.  For  a 
two-valued  variable,  one  can  simply  assign  1  to  one 
value  and  -1  to  the  other.  If  the  variable  has  a  range 
of  more  than  two  values,  then  each  variable-value  pair 
can  be  mapped  to  a  propositional  variable,  which  is 
TRUE  if  and  only  if  the  variable  has  the  paritucular 
value  in  the  instance  (Hampson  &  Volper,  1986).  This 
avoids  imposing  any  order  on  the  unordered  values  of 
the  variable.  With  this  mapping,  one  can  represent 
concepts  over  unordered  variables,  ordered  variables, 
or  a  mix  of  such  variables.  Mapping  many-valued 
variables  to  two-valued  variables  has  been  observed 
to  result  in  trees  with  higher  classification  accuracy 
(Cheng,  Fayyad,  Irani  &  Qian,  1988;  Mooney,  Shavlik, 
Towell  &  Gove,  1989).  There  are  two  reasons.  First, 
the  two- valued  variables  provide  a  finer-grained  repre- 
sexitation,  which  makes  it  possible  to  find  a  better  de¬ 
cision  tree.  Cheng  et  cil.  point  out  that  this  approach 
also  eliminates  the  irrelevant  values  of  a  variable.  Sec¬ 
ond,  a  binary  split  of  the  instance  space  leaves  just  two 
subspaces,  placing  a  larger  proportion  of  the  available 
training  instances  in  each  subspace  (Pagallo  &  Haus- 
slet,  1988).  Because  learning  in  each  subspace  is  based 
on  a  larger  number  of  instances,  the  subtree  found  will 
be  better  determined. 
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2.5  Missing  Values 

For  some  instances,  it  may  be  that  not  all  variable 
values  are  available.  In  such  a  case,  one  would  like 
to  estimate  the  missing  values.  This  is  true  for  both 
training  and  classification.  Quinlan  (1989)  describes 
a  variety  of  approaches  for  handling  missing  values  of 
unordered  variables.  These  include  ignoring  any  in¬ 
stance  with  a  missing  value,  filling  in  the  most  likely 
value,  and  combining  the  results  of  classification  us¬ 
ing  each  possible  value  according  to  the  probablility  of 
that  value. 

Another  approach  for  handling  a  missing  value  is  to 
estimate  it  using  the  sample  mean,  which  is  an  unbi¬ 
ased  estimator  of  the  expected  value.  For  a  numeric 
(ordered)  variable,  one  need  only  maintain  a  running 
total  of  the  values  seen  and  a  running  count  of  the 
number  of  values  seen.  This  approach  can  also  be  ap¬ 
plied  to  missing  values  of  unordered  variables.  If,  as 
suggested  above  in  Section  2.4,  one  has  mapped  ev¬ 
ery  nonnumeric  variable  to  a  numeric  variable,  then 
this  single  mechanism  also  estimates  a  missing  value 
for  a  nonnumeric  vatiable.  There  are  two  types  of 
mappings  to  consider.  First,  if  the  original  variable 
is  two-valued,  then  it  is  mapped  to  a  single  numeric 
variable,  with  one  value  corresponding  to  1  and  the 
other  to  -1.  For  a  missing  value,  one  simply  uses 
the  sample  mean.  Second,  if  the  original  variable  is 
many-valued,  then  it  is  mapped  to  a  set  of  proposi¬ 
tional  (two-valued)  variables,  with  each  one  treated  as 
above.  For  a  missing  value,  one  uses  the  sample  mean 
for  each  of  the  propositional  variables.  One  would  like 
to  use  the  sample  mean  so  that  the  location  of  the  in¬ 
stance  in  the  encoded  Euclidian  n-space  is  estimated 
M  accurately  as  possible.  Note  that  the  sample  mean 
of  a  variable  at  one  node  can  differ  from  the  sample 
mean  of  the  same  variable  at  a  different  node  due  to 
the  different  sets  of  observed  instances  on  which  each 
is  based. 

3  The  PT2  Decision  Tree  Algorithm 

This  section  presents  the  PT2  algorithm  for  inducing  a 
decision  tree  from  a  stream  of  training  instances.  The 
design  of  the  algorithm  was  motivated  by  the  issues 
discussed  above.  The  algorithm  maintains  the  deci¬ 
sion  tree  eis  a  global  data  structure,  and  revises  it  in¬ 
crementally,  as  necessary,  in  response  to  each  received 
training  instance.  A  legal  decision  tree  is  either  NIL, 
for  the  empty  tree,  or  an  answer  node  containing  a 
class  name,  or  a  decision  node  containing  one  or  more 
Boolean  variables,  each  of  which  defines  a  binary  split 
of  the  instance  space.  One  of  these  variables  is  desig¬ 
nated  as  current,  and  has  a  branch  to  a  decision  tree 
for  each  of  its  two  possible  values.  The  initial  tree  is 
empty. 

Table  1  specifies  the  PT2  algorithm,  which  is  de¬ 
signed  to  learn  single  concepts  over  two  classes.  At 
each  decision  node,  it  searches  for  a  binary  split  of  the 
instance  space.  This  search  proceeds  on  two  fronts. 
First,  a  possible  binary  split  is  represented  as  a  lin¬ 
ear  threshold  unit  (LTU)  over  the  original  input  vari- 


Table  1:  The  PT2  Decision-Tree  Update  Algorithm. 

1.  If  TREE  is  empty,  then  set  TREE  to  new  answer  node 
contuning  class  name  of  training  instance,  return. 

2.  If  TREE  is  an  answer  node,  then 

(a)  If  class  name  of  instance  is  same  as  TREE,  then 
return. 

(b)  Set  TREE  to  new  decision  node,  initialize  cur¬ 
rent  LTU  to  one  based  on  all  n  original  variables, 
set  W  «—  0,  set  list  of  alternate  LTUs  to  empty, 
initialize  other  bookkeeping  variables. 

3.  Update  decision  node  at  root  of  TREE,  i.e. 

(a)  Update  the  active  weight  vector  W  of  every  LTU 
at  the  node  via  the  absolute  error  correction  pro¬ 
cedure.  Also  update  the  pocket  vector  P,  and 
other  LTU  variables  as  necessary.  If  P  of  cur¬ 
rent  LTU  has  changed,  then  discard  its  subtrees 
if  they  exist.  Use  the  sample  mean  for  any  miss¬ 
ing  value.  Whenever  P  changes,  also  save  the 
sample  means  that  correspond  to  P. 

(b)  If  the  current  LTU  is  not  yet  determined  (see 
text),  then  return. 

(c)  If  some  alternate  LTU  is  determined  and  is  at 
least  as  good  (see  Table  2)  as  the  current  LTU, 
then 

i.  Replace  the  current  LTU  with  a  best  such 
alternate  LTU. 

ii.  Remove  from  the  current  LTU  all  original 
variables  for  which  all  the  associated  weights 
of  the  encoded  variables  are  0. 

iii.  Reset  the  list  of  alternate  LTUs  to  those 
based  on  all  n  —  1  variable  combinations  of 
those  in  the  new  current  LTU.  For  each  al¬ 
ternate  LTU,  initialize  its  W  by  setting  each 
VI,  to  the  corresponding  p,-  of  the  new  current 
LTU,  update  the  LTU  via  the  absolute  error 
correction  procedure.  Use  sample  mean  for 
any  missing  value. 

(d)  Descend  recursively  along  branch  below  P  of  cur¬ 
rent  LTU  as  per  instance,  return. 


ables.  As  training  instances  are  observed  at  an  LTU, 
its  weights  are  adjusted  as  necessary  in  order  to  move 
its  hyperplane  in  an  attempt  to  separate  the  positive 
instances  from  the  negatives.  Second,  the  algorithm 
attempts  to  find  an  LTU  at  a  node  that  is  based  on 
a  reduced  set  of  the  original  input  variables.  To  this 
end,  a  set  of  LTUs  is  trained  at  each  node  in  an  at¬ 
tempt  to  identify  those  variables  that  can  be  removed 
without  sacrificing  classification  accuracy  at  the  node. 
The  rest  of  this  section  describes  this  search  in  greater 
detail  and  discusses  the  basis  on  which  one  split  is 
judged  better  than  another. 

3.1  Training  an  LTU 

A  linear  threshold  unit  can  be  used  as  a  Boolean  vari¬ 
able,  specifically  as  a  test  at  a  decision  node,  because 
it  is  a  predicate  over  its  inputs.  The  LTU  maps  its 
variables  to  either  the  positive  or  negative  side  of  its 
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hyperplane.  All  original  variables  are  encoded  as  nec¬ 
essary,  as  described  in  Section  2.4,  to  achieve  a  set  of 
numeric  variables,  called  the  encoded  variables.  These 
ate  normalized  d j  namically  so  that  the  maximum  ob¬ 
served  value  a  of  a  variable  maps  to  0.5  and  the  mini¬ 
mum  observed  value  maps  to  —0.5.  The  variables  are 
normalized  so  that  each  has  the  same  influence  during 
the  error  computation  for  the  absolute  error  correction 
rule.  The  mapping  M(o)  is  a  function  of  the  histori¬ 
cal  minimum  and  maximum  observed  for  the  variable. 
Let 

a  =  value  of  the  encoded  variable 
hmin  =  historical  minimum  for  variable 
hmax  =  historical  maximum  for  variable 


Then 


-0.5 

g— hmtn 
hmax—hmtn 


if  hmax  =  hmin 
otherwise 


As  training  instances  are  observed,  each  LTU  is  ad¬ 
justed  as  necessary  via  the  absolute  error  correction 
procedure  (Nilsson,  1965).  Combining  input  features 
to  form  multivariate  splits  is  a  form  of  constructive 
induction;  new  terms  are  created  based  on  linear  com¬ 
binations  of  subsets  of  the  origincil  variables. 


3.2  Eliminating  Variables 

If,  at  a  node,  the  designated  current  LTU  based  on 
n  original  variables  is  no  better  than  one  of  the  al¬ 
ternate  LTUs,  each  based  on  n  -  1  of  these  variables, 
then  the  set  of  LTUs  being  considered  at  the  node  is 
changed.  A  best  such  alternate  LTU  is  designated  as 
current,  the  others  ate  discarded,  and  a  new  set  of 
alternate  LTUs  is  created,  beised  on  leaving  out  one 
variable  from  those  of  the  new  current  LTU.  This  is 
done  in  order  to  find  a  test  on  fewer  variables  if  pos¬ 
sible.  The  idea  of  removing  one  original  variable  at  a 
time  is  taken  from  CART  (Breiinan,  Friedman,  Olshen 
&  Stone,  1984),  and  is  based  on  the  assumption  that 
it  is  better  to  search  for  a  useful  projection  onto  fewer 
dimensions  from  a  relatively  well  informed  state  than 
it  is  to  search  for  a  projection  onto  more  dimensions 
from  a  relatively  uninformed  state.  Note  that  when 
one  original  variable  is  removed,  one  or  more  encoded 
variables  are  removed. 


3.3  Comparing  Splits 

How  does  one  decide  whether  one  split,  as  manifested 
by  an  LTU,  is  better  than  another,  so  that  a  best  LTU 
can  be  designated  current?  This  is  accomplished  with 
a  procedure  that  depends  on  Gallant’s  (1986)  Pocket 
Algorithm.  For  a  linear  threshold  unit,  the  algorithm 
saves  in  P  the  best  weight  vector  W  that  occurs  dur¬ 
ing  normal  perceptron  training,  eis  measured  by  the 
longest  run  of  consecutive  correct  clcissifications,  called 
the  pocket  count,  assuming  that  the  observed  instances 
are  chosen  in  a  random  order.  Gallant  shows  that  the 
probability  of  an  LTU  b2ised  on  the  pocket  vector  P 
being  optimal  approaches  1  cis  training  proceeds.  The 
pocket  vector  is  optimal  in  the  sense  that  no  other 
weight  vector  visited  so  far  is  likely  to  be  a  more  accu¬ 
rate  classifier.  The  Pocket  Algorithm  fulfills  a  critical 


Table  2:  Procedure  to  Determine  Whether  LTUl  is 
Better  than  LTU2. 

1.  If  either  of  LTUl  or  LTU2  is  undetermined  then  re¬ 
turn  FALSE. 

2.  If  the  pocket  count  of  LTUl  is  higher  than  that  of 
LTU2,  then  return  TRUE. 

3.  If  the  pocket  count  of  LTUl  is  identical  to  that  of 
LTU2,  and  LTUl  is  based  on  fewer  original  variables, 
then  return  TRUE; 

4.  Return  FALSE. 


role  when  searching  for  a  separating  hyperplane  be¬ 
cause  the  classification  accuracy  of  the  LTU  based  on 
W  is  unpredictable  when  the  instances  are  not  linearly 
separable  (Duda  &  Hart,  1973). 

One  might  simply  assume  that  the  LTU  with  the 
highest  pocket  count  provides  the  best  split,  but  there 
are  two  problems  with  such  an  assumption.  The  pro¬ 
cedure  shown  in  Table  2  was  devised  to  decide  whether 
one  LTU  provides  a  better  split  than  another,  and  it 
depends  on  the  definition  of  determined  given  below. 
To  understand  the  need  for  this  procedure,  consider 
the  problems  that  would  arise  from  using  the  pocket 
count  rJone. 

First,  one  cannot  select  an  LTU  if  its  pocket  vector 
fails  to  discriminate  among  the  instances.  Consider  a 
learning  problem  in  which  the  training  instances  be¬ 
long  predominantly  to  one  class.  An  LTU  can  classify 
a  long  sequence  of  instances  correctly  if  it  always  clas¬ 
sifies  each  one  as  the  more  frequently  occurring  class. 
Indeed,  such  a  split  may  result  in  higher  classification 
accuracy  than  any  split  that  actually  discriminates  in¬ 
stances  in  one  class  from  the  other.  However,  a  split 
that  does  not  discriminate  is  no  split  at  ail.  If  the  space 
of  instances  at  a  node  is  not  split,  then  the  space  of  in¬ 
stances  at  one  subtree  will  be  identical  to  the  space  at 
the  parent.  Thus,  the  same  null  split  would  be  found 
at  the  subtree,  the  process  would  repeat,  and  the  tree 
would  become  infinitely  deep.  To  avoid  this,  an  LTU 
with  a  pocket  vector  that  is  not  based  on  having  ob¬ 
served  instances  from  more  than  one  class  can  never 
be  considered  better  than  another  LTU. 

The  second  problem  is  that  if  the  current  LTU  is 
not  a  perfect  classifier,  then  subtrees  will  be  needed 
to  split  the  space  further.  It  can  be  wasteful  to  try 
to  grow  subtrees  before  there  is  any  strong  indication 
that  a  given  LTU  is  the  best  that  can  be  found  at 
the  node.  This  is  because  the  algorithms  calls  for  dis¬ 
carding  both  subtrees  of  a  node  whenever  the  pocket 
vector  P  of  the  current  LTU  is  redefined.  Such  activ¬ 
ity  will  eventucilly  cease  because  the  pocket  count  of 
the  current  LTU  increases  monotonically.  However,  to 
reduce  lost  effort,  no  subtrees  arc  allowed  to  grow  be¬ 
low  an  LTU  until  enough  evidence  has  accumulated  to 
indicate  that  an  improved  pocket  vector  is  not  apt  to 
be  found  anytime  soon.  The  status  of  an  LTU  is  con¬ 
sidered  to  be  determined  if  and  only  if  the  following 
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conditions  are  true: 

1.  At  least  kn,  k>  2,  instances  must  have  been  pre¬ 
sented  to  the  LTU,  where  n  is  the  number  of  orig¬ 
inal  variables.  This  is  a  minimcJ  test  based  on  the 
capacity  of  a  hyperplane  (Duda  &  Hart,  1973);  if 
fewer  than  2n  instances  have  been  observed,  then 
the  LTU  is  known  to  be  underdetermined. 

2.  As  discussed  above,  the  pocket  vector  for  the  LTU 
must  separate  at  least  one  instance  from  each 
class. 

3.  Either;  the  pocket  vector  P  is  identical  to  the  ac¬ 
tive  weight  vector  W,  or  each  and  every  weight 
in  W  has  been  varying  within  its  historical  min¬ 
imum  and  mcLximum  for  a  number  of  weight  ad¬ 
justments  greater  than  the  log  of  the  number  of 
weights  in  W  (Utgoff,  1988;  Utgoff,  in  press). 

Note  that  the  status  of  the  LTU  can  change  back  and 
forth  between  determined  and  undetermined  if  a  new 
historical  minimum  or  maximum  is  established  infre¬ 
quently.  Also  note  that  one  can  raise  k  to  cause  the 
algorithm  to  be  more  conservative  about  determining 
the  status  of  an  LTU.  In  the  current  implementation 
of  PT2,  the  default  value  of  k  is  5. 

4  Illustrations 

This  section  illustrates  the  PT2  algorithm  on  three 
standard  learning  tasks.  The  first  is  the  DNF  con¬ 
cept  used  to  illustrate  the  FRINGE  algorithm  (Pa- 
gallo,  1989).  This  concept  is  expressed  succinctly  by 
allowing  multivariate  tests  at  a  node.  Second  is  the 
multiplexor,  both  for  the  six-bit  and  three-bit  cases. 
Decision-tree  algorithms  normally  do  very  poorly  on 
this  problem  because  the  best  variable  to  test  at  the 
root  is  not  correlcted  with  the  cleissification.  The  final 
iUustration  is  Quinlan’s  (1987)  hyperthyroid  concept, 
in  which  instances  are  described  by  a  mix  of  unordered 
and  ordered  (numeric)  attributes,  some  with  missing 
values. 

4.1  The  DNF  Task 

The  DNF  task  was  used  previously  to  illustrate  the 
ability  of  FRINGE  to  find  multivariate  tests  at  a  node, 
though  by  a  much  different  mechanism  from  that  cf 
PT2.  The  concept  to  be  learned  is  the  Boolean  func¬ 
tion  ab  V  cde.  Although  there  are  only  32  possible  in¬ 
stances,  the  algorithm  al.vays  obtains  its  next  training 
instance  by  selecting  randomly  from  the  full  space  of  32 
instances.  Figure  2  shows  the  tree  found  by  PT2  after 
training  on  775  instances,  for  clarity,  the  weights  have 
been  scaled  by  a  constant  factor.  The  tree  consists  of 
two  tests  and  three  leaves,  and  is  logically  equivalent 
to  that  found  by  FRINGE.  Recall  that  PT2  automat¬ 
ically  encodes  TRUE  as  0.5  and  FALSE  as  -0.5.  The 
tree  found  by  ID3  for  this  task  consists  of  eight  tests 
and  nine  leaves. 

4.2  The  Multiplexor  Task 

PT2  was  run  on  both  the  six-bit  and  the  three-bit 
multiplexor  tasks  (Barto,  1985;  Wilson,  1987;  Quin¬ 
lan,  1988).  The  three-bit  multiplexor  is  of  interest  for 


Figure  2:  Tree  for  the  DNF  Task 


the  sake  of  illustration  rather  than  the  complexity  of 
the  task.  The  three-bit  multiplexor  is  a  Boolean  func¬ 
tion  over  three  Boolean  variables.  One  variable,  here 
called  /,  is  called  the  address  bit,  and  it  serves  as  a 
selector  of  one  of  the  two  variables  g,  and  h,  called  the 
data  bits.  The  value  of  the  function  is  the  value  of  the 
selected  data  bit.  The  simplest  tree  for  this  concept 
corresponds  to  the  expression  fg  V  fh.  The  problem 
is  interesting  because  the  address  bit  is  the  best  test 
at  the  root,  yet  it  is  not  correlated  with  the  classifica¬ 
tion,  although  the  data  bits  are.  Thus,  for  algorithms 
that  choose  a  split  on  such  a  basis,  any  data  bit  looks 
like  a  better  choice  than  any  address  bit.  There  ex¬ 
ists  a  split  on  all  three  variables  that  is  correct  for 
7  of  the  8  instances,  and  PT2  finds  it,  as  the  pocket 
co:’.nt  is  highest  for  this  split.  PT2  found  a  correct 
tree  after  training  on  480  instances,  as  shown  in  Figure 
3.  This  PT2  tree  contains  two  tests  and  three  leaves, 
v’htreas  the  tree  found  by  ID3  contains  five  tests  and 
six  leaves  respectively.  Note  that  the  test  at  the  root 
evaluates  all  three  variables,  which  is  also  suboptimal. 
One  would  prefer  a  tree  in  which  only  necessary  tests 
are  performed. 

For  the  six-bit  multiplexor  teisk  PT2  found  a  cor¬ 
rect  tree  of  ten  tests  and  eleven  leaves  after  trriining 
on  3,968  instancto  from  the  full  space  of  64  possible 
instances.  Total  CPU  time  during  training  was  143 
seconds.  A  characteristic  of  the  PT2  algorithm  is  that 
the  first  correct  tree  that  it  finds  may  not  be  the  small¬ 
est  that  it  would  find  if  training  were  to  continue.  It 
is  possible  that  the  current  LTU  at  a  node  will  be  re¬ 
placed  by  one  that  is  better.  Such  improvements  can 
result  either  from  an  improved  pocket  vector  in  the 
current  LTU,  or  by  replacing  the  current  LTU  with 
one  that  is  based  on  fewer  of  the  original  input  vari¬ 
ables.  For  the  six-bit  multiplexor  task,  PT2  was  left 
to  continue  training  until  it  had  seen  7 1,800  trciining 
instances.  A.lthough  the  final  tree  was  the  same  size 
as  the  first  correct  tree,  in  terms  of  tests  and  leaves, 
some  of  the  LTUs  were  replaced  by  LTUs  based  on 
fewer  variables.  The  total  number  of  variables  in  all 
ten  LTUs  of  the  first  corrv,ct  tree  was  38,  but  this  total 
in  the  final  tree  was  25.  The  tree  that  ID3  produces 
for  thir  task  contains  25  tests  and  26  leaves. 
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Figure  3:  Tree  for  the  Three-Bit  Multiplexor  Task 


4.3  The  Hyperthyroid  Task 

The  hyperthyroid  task  (Quinlan,  1987)  is  of  interest 
because  its  instances  are  described  in  terms  of  both  nu¬ 
meric  and  nonnumeric  variables,  and  because  there  are 
training  instances  with  missing  values  for  both  types 
of  variables.  The  training  set  contains  2,800  instances, 
each  falling  into  one  of  four  classes.  Because  PT2  cur¬ 
rently  handles  only  two-class  problems,  the  task  was 
cast  as  learning  the  concept  of  “hyperthyroid”.  In¬ 
stances  labelled  as  “hyperthyroid”  were  considered  to 
be  positive,  and  all  other  instances  were  considered 
to  be  negative.  Each  instance  is  described  by  28  vari¬ 
ables  (one  variable  whose  value  is  missing  for  all  the  in¬ 
stances  is  not  included).  Across  all  28  variables  for  the 
2800  training  instances,  there  are  1,756  missing  values, 
for  an  average  of  0.63  missing  values  per  instance.  The 
independent  test  set  contains  972  instances  described 
in  terms  of  the  same  variables,  with  a  total  of  536 
missing  values,  for  an  average  of  0.65  missing  values 
per  instance. 

Due  to  the  lopsided  distribution  of  the  training  in¬ 
stances,  with  97.79%  being  negative,  a  variation  of 
random  sampling  was  used  for  selecting  the  next  in¬ 
stance  during  training.  Instead  of  sampling  randomly 
from  the  entire  population  of  training  instances,  the 
instances  from  the  two  classes  were  kept  as  two  sepa¬ 
rate  populations.  An  instance  was  selected  randomly 
within  a  population  but  the  choice  of  population  al¬ 
ternated  each  time.  This  training  strategy  supplies 
an  even  distribution  of  instances  at  the  root,  but  the 
effect  diminishes  at  the  subtrees  because  the  tree  is  at¬ 
tempting  to  separate  the  positives  from  the  negatives. 
PT2  does  not  depend  on  this  training  strategy.  The 
rationale  is  simply  that  with  a  lopsided  distribution, 
most  instances  will  come  from  one  class  and  therefore 
be  uninformative. 

Figure  4  shows  the  tree  found  by  PT2,  having 
trained  on  a  total  of  372,400  instances  for  approxi¬ 
mately  one  clock  week.  The  tree  classifies  99.46%  of 
the  training  data  correctly  and  99.07%  of  the  test  data 
correctly.  In  the  figure,  the  notation  a.v  indicates 
a  propositional  variable  corresponding  to  value  t)  of 
the  original  input  variable  o.  Thus,  liui  contains  25 
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Figure  4:  Tree  for  the  Hyperthyroid  Task 


weights  based  on  22  original  variables,  and  liu2  con¬ 
tains  27  weights  based  on  25  original  variables. 

It  is  unclear  whether  99.07%  correct  classificallon 
on  the  test  data  is  good,  especially  given  that  sim¬ 
ply  always  guessing  negative  yields  98.25%  correctly 
classified.  Precise  classification  accuracies  of  previous 
solutions  related  to  this  task  (Quinlan,  1987;  Chan, 
1989)  were  not  published.  It  is  worth  noting  that  both 
C4  and  PT2  chose  ftival  as  the  most  important  vari¬ 
able  at  the  root.  Also  note  that  for  PT2,  one  can  rank 
the  relative  importance  of  the  variables  by  comparing 
the  magnitudes  of  their  respective  weights.  It  is  mean¬ 
ingful  to  compare  weights  because  PT2  normalizes  all 
the  variable  values. 
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5  Discussion 

This  section  identifies  strengths  and  weaknesses  of  the 
PT2  algorithm,  and  indicates  the  directions  in  which 
the  research  is  proceeding. 

5.1  Strengths 

There  are  four  strengths  of  the  algorithm.  First,  it 
is  incremental,  which  means  that  additional  training 
instances  received  at  a  later  time  will  not  obviate  pre¬ 
vious  learning  by  requiring  that  a  new  tree  be  built 
from  scratch.  This  is  highly  desirable  for  learning  al¬ 
gorithms  that  are  imbedded  in  systems  that  learn  from 
experience. 

Second,  the  algorithm  finds  a  multivariate  test  at 
a  node  by  training  a  linear  threshold  unit.  This  al¬ 
lows  defining  regions  in  the  instance  space  that  have 
boundaries  in  any  orientation.  This  richer  space  of 
splits,  compared  to  allowing  only  univariate  splits,  en¬ 
ables  the  program  to  find  more  compact  trees.  The 
algorithm  attempts  to  find  a  linear  split  that  is  opti¬ 
mal  in  terms  of  classification,  but  that  is  based  on  a 
small  subset  of  the  original  variables. 

Third,  the  algorithm  treats  all  variable  types  uni¬ 
formly  by  encoding  their  ranges  numerically.  The  en¬ 
coded  values  are  normalized  by  mapping  each  range 
onto  the  interval  [-0.5,0.5).  Encoding  and  mapping 
are  determined  dynamically.  This  encoding  allows  one 
to  learn  concepts  over  instances  that  are  described  by 
a  mix  of  ordered  and  unordered  variables,  rather  than 
one  or  the  other. 

Finally,  the  algorithm  estimates  missing  values  via 
sample  mean,  which  is  an  unbiased  estimator  of  the 
expected  value.  This  is  well  defined  for  all  variable 
types  because  they  are  all  encoded  numericcdly. 

5.2  Weaknesses 

There  are  four  clear  weaknesses  in  the  algorithm, 
which  are  being  addressed  as  this  work  continues. 
First,  the  mechanism  for  dropping  variables  from  a 
LTU  is  too  costly.  The  algorithm  may  train  0{n?) 
LTUs  at  a  node  while  searching  for  a  good  split.  This  is 
worse  than  the  predecessor  perceptron  tree  algorithm 
(Utgoff,  1988),  which  trained  just  one  LTU  at  each 
node. 

Second,  there  is  no  guarantee  that  PT2  will  find 
a  tree  that  satisfies  the  sequential  testing  and  under- 
standability  goals  outlined  in  Section  2.3.  Indeed,  the 
tree  found  for  the  hyperthyroid  task  is  poor  in  this 
regard. 

Third,  the  algorithm  does  not  take  advantage  of 
pruning  techniques  that  have  been  designed  to  pre¬ 
vent  overfitting  the  training  data  (Breiman,  Friedman, 
Olshen  &  Stone,  1984;  Mingers,  1989).  Furthermore, 
the  algorithm  is  also  subject  to  underfitting  error.  If 
too  few  unique  instances  are  delivered  to  a  node,  with 
respect  to  the  dimensionality  of  the  instance  descrip¬ 
tions,  then  an  LTU  will  be  underdetermined  (Duda  & 
Hart,  1973). 

Finally,  the  algorithm  is  limited  to  discriminating 
just  two  classes. 


5.3  Near-Term  Extensions 

The  algorithm  is  being  extended  in  three  ways.  First, 
a  better  method  for  discarding  variables  at  a  node  is 
being  investigated.  Given  that  the  encoded  variables 
are  normalized,  one  can  identify  the  most  important 
variables  by  the  magnitudes  of  the  weights.  One  could 
base  the  LTU  for  the  split  on  just  those  variables  that 
have  sufficiently  large  magnitudes,  compared  to  the 
largest  magnitude  among  the  weights.. 

Second,  the  ability  to  discriminate  more  than  two 
classes  is  being  added.  Several  known  methods  are 
being  considered,  but  the  most  promising  is  to  sepa¬ 
rate  the  instances  into  two  superclasses  at  each  node 
(Breiman,  Friedman,  Olshen  &  Stone,  1984).  Such  a 
scheme  finds  “strategic”  splits,  in  the  sense  that  splits 
near  the  top  of  the  tree  group  together  those  classes 
that  are  similiar,  while  splits  near  the  leaves  isolate 
single  classes. 

Third,  a  pruning  mechanism  is  being  added  that  will 
attempt  to  prevent  overfitting  and  underfitting  prob¬ 
lems. 

5.4  Longer-Term  Extension 

A  longer  term  problem  that  may  be  addressed  is  how 
to  detect  when  the  piece- wise  linear  classification  strat¬ 
egy  is  performing  poorly  and  alter  it  automatically. 
For  example,  if  the  members  of  the  target  concept  de¬ 
fine  a  hyperregion  with  one  or  more  curved  boundaries, 
then  a  potentially  infinite  number  of  hyperplanes  will 
be  needed  to  approximate  the  hyperregion.  This  would 
lead  to  a  very  large  tree,  yet  if  the  space  were  to  be 
split  with  hyperspheres  or  hyperellipses  then  the  tree 
would  be  smaller,  which  would  facilitate  learning. 
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Abstract 

This  paper  proposes  the  notion  of  ‘topological 
relevance’  as  a  means  to  formalize  the 
complexity  of  a  decision  tree  representation  of  a 
set  of  instances.  It  then  presents  an  incremental 
algorithm,  called  IDL,  for  the  induction  of 
decision  trees  which  are  optimal  according  to 
this  notion.  IDL  relies  on  ideas  from  ID4  and 
IDS  but  searches  using  a  statistical  criterion  for 
expanding  nodes  and  a  tree  topological  one, 
called  topological  relevance  gain,  for 
transforming  decision  trees.  The  results  of  its 
analysis  and  empirical  validation  show  that, 
with  provisions,  IDL  rapidly  finds  the  same  or  a 
better  tree  than  top-down  induction  algorithms 
and  their  incremental  versions  ID4,  IDS  and 
IDSR. 

1 .  Introduction 

What  is  a  good  decision  tree?  Despite  the  fact  that 
induction  of  decision  trees  is  one  of  the  best  developed 
subfields  of  machine  learning  (see  e.g.  [2,3,S,9]),  this 
question  has  not  really  been  answered.  Of  course 
‘goodness’  has  multiple  facets,  some  of  them  depending 
on  the  application,  but  classification  accuracy  and  tree 
complexity  are  important  factors.  But  could  one 
recognize  a  good  tree  if  one  sees  one?  Our  quality 
measures  are  too  vague  to  do  so.  It  is  a  general 
shortcoming  of  top-down  induction  of  decision  tree 
algorithms  (TDIDT,  [5])  that  there  is  no  intentional 
description  of  the  tree  they  construct  Starting  from  a  set 
of  examples,  usually  described  as  a  set  of  atuibute-value 
pairs  annotated  with  a  class  TDIDT  algorithms 
successively  split  the  set  using  a  good  (e.g.  most 
informative)  attribute  until  sets  of  examples  in  a  single 
class  are  left  The  splits  become  the  nodes  in  the  tree  and 
the  leaves  are  labeled  with  the  class  of  the  examples.  The 
tree  such  an  algorithm  is  looking  for  is  simply  the  one  it 
happens  to  come  up  with.  There  is  no  reason  why  this 
should  be  in  some  sense  the  best  one.  Accuracy  of  the 
trees  tends  to  vary  little  across  attribute  selection 
measures  f21.  However,  size  does  vary  and  is  often  much 
larger  than  optimal.  For  example  the  following  is  one  of 
the  two  smallest  decision  trees  for  the  6-multiplexer,  and 
far  beyond  the  capabilities  of  TDIDT  algorithms  ([6],  see 
also  figure  2): 


In  this  paper  I  precisely  define  a  notion  of  topological 
minimality  of  decision  trees.  The  above  tree  is  an 
example  of  a  topologically  minimal  tree.  Intuitively 
speaking,  a  decision  tree  which  correctly  classifies  a  set 
of  examples  is  topologically  minimal  if  an  attribute  is 
localized  as  much  as  possible  in  the  tree  avoiding,  for 
example,  obsolete  replications  of  subtrees.  TDIDT 
algorithms  may  find  these  topologically  minimal  trees 
but  in  other  cases,  like  for  multiplexer  concepts,  fail  to 
do  so.  So  I  go  on  to  present  an  algorithm,  called  IDL, 
which  was  designed  to  incrementally  construct  such 
optimal  trees.  IDL  searches  using  a  statistical  selection 
measure  like  any  of  those  TDIDT  would  use  [2]  for 
expanding  nodes,  but  a  new  and  topological  measure, 
called  topological  relevance  gain,  for  transforming  a  tree 
into  a  smaller  one.  Topological  relevance  is  measured  by 
using  the  decision  tree  in  a  bottom-up  (hypothesis- 
driven)  fashion.  It  expresses  the  import  of  an  attribute  for 
classifying  an  example  based  on  tree  topology  and  not,  as 
is  usual,  on  statistics  over  a  training  set.  Formal  analysis 
is  preliminary  but  experiments  show  that  IDL  is  good  at 
finding  a  topologically  minimal  tree  if  it  exists,  and  with 
less  work  than  the  other  incremental  algorithms  IIM  [7], 
IDS  [9]  and  IDSR  [11]  which  often  find  suboptimal  trees. 
Current  IDL  deals  badly  with  noise  but  I  will  ignore  that 
problem  in  this  paper. 

The  paper  is  structured  as  follows.  Section  2  motivates 
and  defines  the  notions  of  topological  relevance  and 
topological  minimality.  Section  3  presents  the  IDL 
algorithm  and  a  detailed  example.  Section  4  presents  the 
results  of  complexity  analysis  and  some  experiments 
comparing  IDL  with  IDSR.  Section  S  describes  related 
and  future  work  and  some  open  problems. 
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2 .  Topological  Relevance 

Given  a  tree  T  with  root  N,  what  can  be  said  about  the 
relevance  of  an  attribute  A  for  classifying  an  example  E? 
I  propose  to  consider  a  topological  notion  of  relevance 
(i.e.  based  on  the  structure  of  the  tree)  instead  of  a 
statistical  one  (i.e.  based  on  statistics  over  the  training 
set).  1  will  use  the  example  domain  from  [5]  to  illustrate 
the  idea.  The  set  of  examples  is  shown  in  table  1. 

Table  1:  The  Weather  Concept  from  [5] 


Attributes 


Outlook  Temperature  Humidity  Windy 


1 

sunny 

hot 

high 

false 

N 

2 

sunny 

hot 

high 

true 

N 

3 

overcast 

hot 

high 

false 

P 

4 

rain 

mild 

high 

false 

P 

5 

rain 

cool 

normal 

false 

P 

6 

tmn 

cool 

normal 

true 

N 

7 

overcast 

cool 

normal 

true 

P 

8 

sunny 

mild 

high 

false 

N 

9 

sunny 

cool 

normal 

false 

P 

10 

rain 

mild 

normal 

false 

P 

11 

sunny 

mild 

normal 

true 

P 

12 

overcast 

mild 

high 

hue 

P 

13 

overcast 

hot 

normal 

false 

P 

14 

rain 

mild 

_ high..  - 

true 

N 

Assume  one  has  the  following  tree  T: 


Consider  example  1  (call  it  E).  This  example  is  classified 
correctly.  The  branches  and  the  leaf  node  involved  in  the 
classification  process  starting  from  the  root  of  the  tree 
form  the  classification  path  of  E: 

(outlook=sunny  windy=false  humidity=high  N) 

This  classification  path  is  highlighted  with  arrows 
pointing  downwards  (i.e.  the  direction  of  the 


classification  process).  I  say  that  a  path  covers  an 
example  if  all  tests  on  it  are  satisfied  by  that  example, 
including  the  class  decision.  Here  a  branch  is  interpreted 
as  a  test  whether  the  test  attribute  at  the  parent  node  has 
as  value  the  label  of  the  branch.  The  classification  path 
obviously  covers  the  example  and  moreover  it  is  the  only 
complete  one  through  T,  i.e.  one  connecting  the  root 
with  a  leaf.  In  addition,  there  are  a  number  of  partial 
paths  that  also  cover  the  example.  Some  of  these  are 
found  by  using  the  tree  in  a  bottom-up  fashion:  Starting 
from  all  leaves  in  the  tree  which  are  labelled  with  N,  the 
example  climbs  the  branches  who's  test  it  satisfies.  In 
this  way  the  classification  path  and  a  number  of  partial 
paths  covering  the  example  are  reconstructed.  These 
partial  paths  have  been  highlighted  in  the  previous  figure 
with  arrows  pointing  upwards  (i.e.  the  direction  in  which 
they  are  constructed).  There  are  two  such  partial  paths: 

(N)  and  (humidity=high  N) 

The  occurrence  of  the  attribute  humidity  on  the 
classification  path  as  well  as  on  one  of  the  partial  paths 
can  be  interpreted  as  follows: 

(1)  it  confirms  the  relevance  of  humidity  for  the  class- 
membership  of  E  because  it  plays  a  role  in  the 
hypothesis-driven  confirmation  of  E’s 
classification  result  (N).  Judged  by  tree  structure, 
humidity=high  is  highly  predictive  for  the  class  N. 

(2)  the  value  of  the  closest  common  attribute  above 
the  two  humidity-nodes  (windy)  does  not  influence 
the  role  of  humidity  in  E's  classification. 
Branching  on  windy  docs  not  change  the  way  in 
which  humidity  is  used. 

For  a  given  ucc  T,  attribute  A  and  example  E  in  class  C, 
let  us  call  the  number  of  occurrences  of  A  on  the 
(complete  or  partial)  paths  reconstructed  by  hypothesis- 
^iven  tree  climbing  using  E  and  starting  from  Ts  leafs 
labeled  C,  the  topological  relevance  TRt(A,E)  of  A  for 
E.  Thus,  these  arc  the  scores  for  topological  relevance: 

TR'r(oullook£)=l 

TR’I<tcmpcraturcE)=0 

TRj(humidily3)=2 

TRT(windyE)=l 

The  first  observation  above  suggests  that  topological 
relevance  measures  the  import  of  an  attribute  for 
determining  the  class  of  an  example  (i.e.  the  predictive 
power  of  the  attribute-value  for  the  class).  The  second 
observation  supports  the  hypothesis  that  testing 
humidity  first  may  make  the  windy  test  obsolete. 
Therefore,  if  both  attribiites  can  be  switched  chances  are 
that  the  tree  (actually  its  subtrees)  can  be  p:uned. 

A  nee  transformation  technique  which  is  also  used  in 
1D5  and  ID5R  [9, 1 11  can  be  used  to  swap  the  attributes 
windy  and  humidity.  This  technique,  which  is  justified 
by  commutativity  of  attribute  tests,  interchanges  the 
windy  and  humidity  attributes  and  regroups  the  subtrees 
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at  the  second  level  down  (in  this  case  the  four  leaves). 
One  obtains  the  following  tree: 


The  windy  tests  have  now  become  obsolete  and  pruning 
at  these  n^es  results  in  the  following  decision  tree: 


higl^  .normal  true  /\  false 


N 


The  classification  path  of  E  is  highlighted  in  the  pruned 
tree.  The  scores  for  the  topological  relevances  of  the 
attributes  for  the  example  in  this  tree  are  now  the 
following: 

TRT(outlook3)=l 

TRT(tcmperatureJE)=0 

TRT(humidity^)=l 

TRT(windy3)=0 


The  topological  relevance  of  the  attribute  humidity  is 
now  1,  which  is  as  low  as  it  can  get  if  humidity  is 
relevant  at  all.  So  in  this  case  the  transformation  did 
minimize  topological  relevance,  intuitively  meaning  that 
an  attribute  s  activity  is  Iccalizcd  as  iriuch  as  possiole  in 
the  tree  (compare  with  the  topological  redundancy  in  the 
original  tree).  This  was  achieved  by  pulling  up 
topologically  more  relevant  attributes  in  the  tree.  This  is 
the  intuitive  basis  of  the  IDL  algorithm  which  is 
described  in  section  3. 

Note  that  topological  relevance  of  an  attribute  for  an 


example  is  not  the  same  as  the  number  of  oc'  Tences  of 
an  attribute  in  the  tree.  For  example,  in  the  original  tree 
the  topological  relevance  TR7(windy3)  is  1  despite  the 
apparent  symmetry  of  the  tree  with  respect  to  that 
attribute.  Also  note  that  the  topological  relevance  of  an 
attribute  is  at  least  one  for  an  atU'ibute  which  is  used  on 
the  classification  path. 

It  is  easily  verified  that  for  the  final  tree  the 
topological  relevance  of  all  attributes  for  all  examples  in 
table  1  is  equal  to  0  or  1.  This  means  that  there  is  no 
topological  redundancy  in  the  tree:  there  is  no  replication 
of  decision  nodes  with  a  similar  role  for  classification. 
Similar  partial  classification  paths  are  represented  by  the 
same  branches. 

I  say  that  a  tree  T  is  topoloficalh  minimal  with 
respect  to  a  set  of  examples  {Ei}i,  if  it  is  correct,  cannot 
be  pruned  without  loss  of  correctness  and  TRT(A,Ei)^l 
for  each  attribute  A  and  every  i.  (I  also  assume  that  a 
decision  tree  is  well-formed  meaning  that  it  has  no 
decision  nodes  with  only  one  branch  (infra)).  So  the  final 
decision  tree  shown  above  is  topologically  minimal  with 
respect  to  the  training  set  from  table  1.  The  optimal  tree 
for  the  6-multiplcxcr  which  was  shown  before  is  also 
topologically  minimal  in  the  above  sense  with  respect  to 
the  64  instances. 

3.  IDL 

IDL  is  a  new  incremental  algorithm  for  the  induction  of 
decision  trees  which  are  topologically  minimal  in  the 
sense  explained  in  section  2.  Earlier  incremental  decision 
tree  algorithms,  in  particular  ID4  [7],  IDS  [9]  and  ID5R 
[11]  are  essentially  incremental  versions  ofTDlDT  [5]: 
they  use  backtracking  (in  ID4)  or  simulated  backtracking 
(in  IDS  and  IDSR)  to  recover  as  best  as  possible  and  with 
minimal  loss  of  training  effort  from  deviations  from 
TDIDT’s  search  path  (general  to  specific  hill-climbing 
without  backtracking)  which  is  considered  to  be  “ideal” 
and  leading  to  the  best  tree.  IDL  builds  on  this  previous 
work  and  uses  two  ideas  which  stem  from  ID4  and  IDS 
respectively: 

ID4:  one  can  maintain  atU'ibute-valuc  and  class  counts 
for  every  potential  test  atbibute  at  every  node  and 
use  these  to  find  the  best  test  attribute  for  splitting 
according  to  a  statistical  selection  measure  [2] 
without  re-examining  the  examples; 

IDS:  a  test  attribute  can  be  replaced  by  another  one  by  a 
tree  transformation  technique  which  pulls  up  an 
attribute  from  below  without  re-examining  the 
part  of  the  examples  which  is  implicit  in  the 
decision  tree.  Moreover  attribute-value  and  class 
counts  arc  easily  recalculated. 

IDL  performs  a  heuristic  search  using  three  types  of 
search  steps,  namely  specialization  by  splitting, 
generalization  by  pruning,  and  tran^ormation  by  IDS’s 
pull-up  technique  [9,11).  IDL's  increased  power  stems 
from  two  insights; 
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Tabic  2:  The  IDL  Tree  Update  Algorithm 


Let  T  be  the  current  tree  and  E  the  next  example.  To  update  the  tree  T  using  the  example  E  do  the  following: 

1.  Classify 

Classify  E,  updating  the  count  information  at  each  node  on  tlic  classification  path.  Let  N  be  the  node  where  E  ends 
up  (i.e.  a  leaf  or  a  split  node  with  a  missing  branch). 

2.  Expand 

2.1.  If  E  is  incorrectly  classified  then  expand  N  into  a  tree  using  TDIDT  and  the  statistical  criterion. 

2.2.  If  E  cannot  be  classified  then  add  a  new  branch  to  N  and  label  its  leaf  with  the  class  of  E. 

2.3.  Classify  E  in  the  subtree  rooted  at  N  (updating  count  information)  and  remember  the  part  of  E  which  is  not 
implicit  in  the  classification  path  at  the  leaf  L  where  E  ends  up  in. 

3.  Restructure 

Recursively  update  the  test  attribute  of  the  nodes  on  the  entire  classification  path  of  E  from  leaf  L  to  root  T. _ 


1.  search  using  these  transformation  steps  can 
potentially  come  up  with  better  trees  than  TDIDT 
because  it  lets  hypotheses  be  reached  in  parts  of 
the  search  space  which  are  remote  from  TOIDTs 
search  path; 

2.  a  measure  based  on  topological  relevance  (see 
section  2)  can  be  used  for  selecting 
transformations.  It  exploits  the  knowledge  from 
the  structure  of  the  tree  before  transformation  to 
create  opportunities  for  pruning  after 
transformation. 

Thus,  instead  of  using  transformations  to  stay  close  to 
TDIDT’s  search  path,  IDL  uses  transformations  to  come 
up  with  better  trees.  It  therefore  uses  two  different 
selection  measures:  a  statistical  one  like  any  of  those 
TDIDT  would  use  for  splitting  (specialization)  and  the 
topological  one  for  selecting  uansformations.  IDL  prunes 
(generalization)  wherever  it  finds  obsolete  splits. 


3.1.  IDL:  Algorithm 

IDL  starts  with  an  empty  tree  and  processes  examples 
one  by  one.  With  the  first  example  IDL  creates  a  root- 
node  and  labels  it  with  the  class  of  the  example.  Assume 
a  fixed  statistical  criterion  for  splitting  nodes  (e.g.  any  of 
those  from  [2]).  The  basic  algorithm  for  processing  an 
example  using  the  current  uee  is  shown  in  table  2.  This 
skeleton  is  very  similar  to  IDS's,  except  that  IDS  updates 
test  atU'ibutes  from  root  to  leaf. 

IDL  also  uses  the  transformation  technique  from 
ID5(R)  [9,11]  to  pull  up  an  attribute  to  a  node.  This 
technique  is  based  on  an  operation  which  switches  the 
levels  of  attributes  in  the  tree  while  preserving  (and 
possibly  increasing)  classification  accuracy.  In  IDL  the 
pull-up  technique  requires  one  extra  operation.  Attribute 
switching  regroups  subtrees  and  there  is  no  reason  why 
this  should  result  in  groups  with  more  than  one  subtree 
(see  [9],  [11]  for  details).  In  ID5(R)  the  resulting  single 


Table  3:  The  IDL  Attribute  Revision  Algorithm 


Let  M  be  a  node  on  the  classification  path  of  E.  Let  A]... An  be  the  attributes  used  on  this  path  down  from  M  and 

excluding  the  test  attribute  at  M. 

3.1.  For  each  leaf  under  M  which  is  labelled  with  the  same  class  a  E,  reconstruct  the  largest  path  starting  from  this  leaf 
and  covering  the  example  E,  using  only  attributes  among  Ai...  An. 

3.2.  For  each  attribute  Ai...An  compute  its  topological  relevance  TRM(Ai,E),  which  is  simply  the  number  of 
occurrences  on  all  the  reconstructed  paths.  Note  that  TRM(Ai,E)  >  0. 

3.3.  For  each  attribute  Aj . . . An  also  compute  the  topological  relevance  TRs(Aj,E)  where  S  is  the  immediate  son  of  M 
on  the  classification  path  of  E.  Note  that  TRM(Ai,E)  >  TRs(Ai,E) 

3.4.  For  each  attribute  Ai...An,  compute  the  topological  relevance  gain  as  the  weighted  increase  in  topological 
relevance  between  S  and  N,  i.e.: 

TRGM(Ai,E)  =  (TRM(Ai3)  -  TRs(Ai£))  /  TRM(Ai,E) 

3.5.  If  some  attribute  among  Ai...An  has  a  positive  topological  relevance  gain  then  pull  up  the  one  with  the  highest 
score  to  M.  Ties  are  resolved  by  choosing  the  attribute  nearest  to  M. 

3.6.  Otherwise  compute  the  best  test  attribute  Abest  3l  M  according  to  the  statistical  criterion  and  assure  its  presence  in 
the  tree  with  root  M.  To  assure  the  presence  of  A^est ihe  tfee  with  root  M,  do  nothing  if  Abest  is  already  used 
in  this  tree.  Otherwise,  pull  up  Abest  h)  M.  (The  pull-up  process  will  make  Abest  appear). 

3.7.  Prune  any  subtree  of  M  which  is  obsolete,  i.e.  all  leaves  under  it  are  labeled  with  the  same  class. _ 
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branch  splits,  being  uninformative,  will  automatically 
disappear  through  subsequent  training  (in  ID5)  or 
recursive  restructuring  of  the  subtrees  (in  ID5R).  In  IDL 
this  is  not  the  case  and  they  need  to  be  explicitly 
removed.  This  is  achieved  by  applying  the  same  pull-up 
technique  to  push  the  uninformative  attribute  down  and 
eventu^Iy  out  of  the  tree  (see  also  [12]). 

The  crucial  distinction  between  IDL  and  the  other 
algorithms  is  in  the  heuristic  for  applying  tree 
transformations.  The  algorithm  to  select  a  better  test 
attribute  is  shown  in  table  3.  IDL  uses  the  example  to 
guide  its  search  for  useful  transformations  in  the  most 
relevant  parts  of  the  tree.  The  role  of  step  3.6  is  to  assure 
enough  diversity  among  the  attributes.  When  a  tree  has 
become  fully  accurate  there  will  be  no  more  expansions 
of  nodes  (step  2)  and  the  transformations  (step  3.5)  alone 
will  never  lead  to  the  appearance  of  a  new  and  possibly 
crucial  attribute.  As  in  IDS(R)  it  is  useful  to  keep  pruned 
subtrees  around  (virtual  pruning)  because  they  embody 
training  effort  which  may  be  relevant  for  subsequent  pull- 
up  operations. 

Note  that  IDL  may  find  the  same  tree  as  the  other 
algorithms.  It  does  so  however  mostly  based  on 
topological  considerations  instead  of  on  statistical  ones.  I 
refer  to  figure  2  for  a  case  in  which  this  makes  a 
substantial  difference. 

3.2.  IDL:  Example 


1)  and  IDL  continues  by  revising  the  test  attributes  on 
the  classification  path,  from  leaf  to  root  (step  3).  At  the 
temperature  node  no  topological  relevance  gains  are 
computed  (step  3.5  fails).  Suppose  that  the  statistically 
most  relevant  attribute  at  that  node  is  humidity.  It  is  not 
present  in  the  subtree  (nothing  is)  so  it  is  pull^  up  (step 

3.6)  by  expanding  two  leaves  using  humidity,  and 
swapping  humidity  with  temperature.  Subsequent 
pruning  removes  the  obsolete  temperature  splits  (step 

3.7) .  After  revising  one  node  the  tree  is  as  follows: 


The  activity  of  IDL  in  the  early  stages  of  training 
consists  mostly  of  node  expansions  (step  2)  so  that  the 
tree  becomes  larger  and  more  accurate.  The  Uu-ger  the  tree 
the  higher  the  chance  for  topological  redundancy. 
Suppose  one  has  reached  this  uee  which  correctly 
classifies  all  but  one  example  (example  1  cannot  be 
classiflcd): 


The  classification  path  of  example  8  in  the  new  tree  is 
highlighted.  IDL  now  revises  attributes  higher  up  on  this 
path.  At  the  outlook  node  this  generates  no  action.  At 
the  root  node  the  paths  are  reconstructed  using  the 
attributes  outlook  and  humidity.^  IDL  finds  that 
humidity  and  outlook  both  have  a  topological  relevance 
gain  of  .5.  Outlook  is  the  highest  up  in  the  tree  and  is 
pulled  up  to  the  root  (step  3.5).  The  subtree  on  the 
branch  outlook=ovcrcast  is  pruned  (step  3.7): 


^  It  is  important  to  compute  the  set  of  attributes  Aj  at 
each  level  of  the  tfee  after  revision  of  the  lower  parts  of 
the  path.  For  example,  before  the  previous 
transformation  humidity  was  not  one  of  them,  but  now  it 
is. 


Suppose  example  8  is  next.  It  is  classified  correctly  (step 
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The  next  example  is  example  1.  Note  that  it  is  now 
classified  correctly:  transformation  did  increase  accuracy. 
The  classification  path  goes  in  the  highly  symmetrical 
subtree  with  windy  at  the  root.  IDL  detects  the 
topological  relevance  of  humidity  at  that  node  for 
example  1  and  pulls  it  up  (step  3.5).  After  this 
transformation,  which  was  illustrated  in  section  2, 
pruning  is  possible  (step  3.7).  IDL  still  has  to  revise  the 
test  at  the  root  but  finds  no  better  atuibutc.  The  tree  is 
topologically  minimal  so  that  none  of  the  examples  leads 
to  further  revisions.  It  should  be  stressed  that,  though  in 
this  case  IDL  finds  the  same  tree  (the  best  one)  as  the 
other  algorithms,  it  docs  so  in  a  very  different  manner. 

4.  Analysis  and  Empirical  results 

4.1.  Complexity  and  Convergence 

In  [12]  I  analyze  worst-case  order  of  magnitudes  for 
training  cost.  As  in  [11]  this  is  decomposed  as  the 
number  of  instance-count  additions  (ica)  and  the  number 
of  evaluations  of  the  statistical  criterion  (E-scorc),  and 
expressed  in  the  number  of  attributes  lAI  and  the  maximal 
branching  factor  b.  Table  4  summarizes  the  results  of  the 
analysis  of  training  cost  per  example  for  IDS,  ID5R  and 
IDL. 


Table  4:  Worst  Case  Training  Cost  per  Example 


leas 

E-scores 

ID5 

CKb'A!) 

0(iAi2) 

ID5R 

0(IAI.b'AI) 

(Xb'A^ 

IIX. 

(XIAl.b'AI) 

0(IA|2) 

As  with  IDS  and  ID5R  one  pass  over  the  examples  is 
sufficient  to  build  a  correct  hee  with  IDL.  However  what 


is  important  for  IDL  is  the  number  of  examples  required 
to  converge  to  the  optimal  tree  when  examples  are 
randomly  seen.  Optimal  means  correct,  topologically 
minimal,  cannot  be  pruned  without  loss  of  accuracy  and 
docs  not  contain  single-branch  splits  (see  section  3). 
Analysis  is  lacking  here  but  experiments  have  shown 
that  for  some  concepts  and  certain  training  histories, 
IDL  may  fail  to  find  the  optimal  tree  even  if  it  exists. 
Elomaa  [1]  shows  and  explains  cases  of  non-convergence 
on  the  3-multiplcxer.  No  case  has  been  observed  where 
IDL  never  finds  the  optimal  tree  if  it  exists,  so  multiple 
runs  of  the  algorithm  may  be  a  solution.  It  is  easily  seen 
from  the  algorithm  in  table  3  that,  if  IDL  ever  generates 
a  correct  tree  which  is  topologically  minimal  then  it  will 
stop  performing  transformations.  This  and  the 
experimental  results  support  the  conjecture  (as  yet  not 
proven)  that,  if  a  topologically  optimal  tree  exists  then 
with  non-zero  probability  IDL  finds  it  and  in  contrast  to 
IDS(R)  sticks  to  it  If  there  exist  multiple  such  trees  then 
IDL  finds  one  of  them,  though  not  necessarily  the 
smallest  in  terms  of  number  of  nodes  (use  multiple 
runs).  If  no  topologically  minimal  tree  exists  then  IDL 
will  not  converge  to  a  unique  tree.  Typically,  with  each 
example  the  tree  is  reconfigured  to  lower  the  topological 
relevance  of  an  attribute,  thereby  increasing  it  for  another 
one.  Some  and  possibly  all  trees  in  this  limit  cycle  have 
lowest  possible  topological  relevance  for  the  set  of 
examples,  though  never  all  one  or  zero  (which  would 
mean  convergence  to  a  unique  tree). 

4.2.  Experimental  Results 

I  did  experiments  to  compare  IDL  and  IDSR^.  The  results 
are  stated  in  terms  of  the  evolution  of  tree  complexity 
(number  of  nodes  and  number  of  leaves)  and  accuracy 
with  increasing  number  of  training  examples.  Training  is 
done  in  groups  of  examples  randomly  chosen  with 
replacement.  Measurements  arc  averaged  over  20  runs. 
The  characteristic  behavior  of  IDL  is  to  grow  a  tree 
which  is  far  too  large  but  then  rapidly  collapses. 

Figure  1  shows  the  results  for  the  weather  concept 
from  table  1.  There  is  a  unique  topologically  minimal 
tree  (8  nodes,  5  leaves).  IDL  has  20  correct  and 
topologically  minimal  trees  after  70  examples.  ID5R  is 
equally  fast  in  accuracy,  but  docs  not  always  find  the 
minimal  tree  (average  of  8.7  nodes  and  5.4  leaves).  In 
numerous  runs  on  this  concept  IDL  never  failed  to  find 
the  topologically  minimal  tree. 

Figure  2  shows  the  results  for  the  6-multiplexer.  There 
are  two  topologically  minimal  trees  which  are  essentially 
equivalent  (15  nodes  and  8  leaves).  For  IDL  all  20  trees 


^IDSR  is  used  with  postpruning  to  remove 
unnecessary  splits  after  transformation  (sec  [9]  pill). 
The  measutemenLs  for  ID5  arc  hardly  different  f^rom  those 
for  ID5R. 
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Figure  1:  IDL  (white  marks)  and  ID5R  (black  marks)  averaged  over  20  runs  on  the  weather  concept  (see  table  1).  Training 
per  S.  Left  axis:  tree  size  plotted  in  squares  (number  of  leaves)  and  triangles  (number  of  nodes).  Right  axis:  accuracy  with 

respect  to  all  examples,  plotted  with  circles. 


are  fully  accurate  after  1 10  examples,  and  topologically 
minimal  after  ISO  examples.  After  200  examples  IDSR 
has  an  average  of  41.6  nodes,  21.3  leaves  and  99.7 
accuracy.  In  numerous  runs  on  this  concept  IDL  never 
failed  to  find  a  topologically  minimal  U'ee.  IDL  has  also 
been  run  on  the  Il-multiplcxer  which  uses  3  address-bits 
and  8  data-bits.  There  are  2048  examples.  The  optimal 
tree  has  16  leaves  and  31  nodes.  IDL  had  a  correct  tree 


after  (less  than)  6S0  examples  and  the  minimal  one  after 
(less  than)  700  examples.  Surprisingly  IDL  sometimes 
fails  to  find  the  minimal  tree  for  the  3-muItiplexer  (about 
40%  hit-rate).  Often  it  limit-cycles  between  the  two 
smallest  TDIDT  equivalent  trees  (see  also  [1]). 

The  concept  of  3-parity  has  6  smallest  trees  with  the 
same  topology,  but  no  topologically  minimal  one  (TR- 
scores  1,1,2  for  all  examples).  The  graphs  are  similar  to 
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Figure  2:  IDL  (white  marks)  and  IDSR  (black  marks)  averaged  over  20  runs  on  the  6-multiplexer.  Training  per  10.  Left 
axis:  tree  size  plotted  in  squares  (number  of  leaves)  and  triangles  (number  of  nodes).  Right  axis:  accuracy  with  rc.spcct  to 

all  examples,  plotted  with  circles. 
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those  in  figure  1.  However,  IDL  goes  into  a  limit  cycle 
of  three  smallest  trees  (IS  nodes,  8  leaves)  with  every 
example  permuting  attributes  through  the  tree  levels. 
ID5R  goes  to  a  slower  cycle  of  all  6  trees. 

I  used  IDL  as  a  postprocessor  for  optimizing  TDIDT 
generated  trees  while  in  use  for  classification.  Ibe  results 
for  the  6-multiplexer  are  shown  in  figure  3.  They  show 
how  IDL  rapidly  collapses  the  trees  to  a  topologically 
minimal  form.  Much  the  same  effect  can  be  achieved 
without  using  any  statistical  information  or  examples  for 
focus,  but  purely  by  analyzing  the  occurrences  of  the 
attributes  on  a  path  from  root  to  leave.  Experiment  like 
this  with  a  variant  of  IDL  are  reported  in  [1]. 

Table  S  shows  the  run-time  improvement  of  IDL  over 
ID5R  expressed  in  icas.  E-scores,  expansions,  prunings 
and  attribute  switches.  The  figures  express  reduction  in 
total  training  cost,  averaged  over  20  consecutive  exam¬ 
ples,  and  averag^  over  20  runs  on  weather  concept 
(examples  55-75)  and  6-multiplexcr  (examples  130-150). 
Note  that  IDL  always  found  the  same  or  a  tetter  tree . 


Table  5:  Computation  Reduction  with  Respect  to  ID5R 


weather 

6-muItiDiexer 

Icas 

11% 

29% 

E-scores 

6% 

32% 

Expansions 

40% 

66% 

Prunings 

44% 

66% 

Transformations 

37% 

49% 

4.3.  Discussion 

Despite  the  fact  that  IDL  sometimes  fails  to  converge  to 
a  topologically  minimal  tree  it  performs  well  on  a 
number  of  standard  concepts.  There  may  be  two  reasons 
for  this.  Firstly,  the  topological  flavor  of  IDL  makes  it 
less  sensitive  to  statistical  variations  (non-representative 


example  distributions).  This  is  well  illustrated  by  the 
weather  concept  where  ID5R  sometimes  converges  to  a 
tree  with  humidity  as  root  For  the  same  training  history, 
IDL  is  not  distracted  by  these  variations.  Secondly  IDL 
uses  individual  examples  to  guide  its  search  but  actually 
reasons  about  classification  paths  which  represent  classes 
of  examples.  This  may  explain  the  fast  convergence  on 
the  6-muItiplexer  for  which  these  classes  are  fairly  large. 
Note  that  IDL’s  reliance  on  individual  examples  likely 
makes  it  very  sensitive  to  noise. 

5 .  Related  and  Future  Work 

5.1.  Related  Work 

IDL  is  a  direct  descendant  of  earlier  incremental  decision 
tree  algorithms  ID4  [7],  ID5  [9]  and  ID5R  [11].  In 
section  3  it  was  noted  that  these  do  not  tackle  the 
problem  of  suboptimal  trees,  because  they  try  to  stick  to 
the  search  path  which  TDIDT  would  follow  and  thus 
inherit  from  TDIDT  the  problem  of  suboptimality. 
Selection  measures  for  TDIDT  algorithms  have  been 
improved  to  generate  smaller  trees  without  loss  of 
accuracy  [2,5].  However,  Quinlan  [6]  shows  why  TDIDT 
algorithms  have  problems  with  concepts  like  the 
multiplexer.  Seshu  [8]  shows  that  TDIDT  algorithms  are 
fundamentally  incapable  of  effectively  learning  a  class  of 
generalized  parity  concepts.  Remedies  fall  in  two 
categories:  subsequent  simplification  of  trees,  or  change 
of  language  bias  by  introducing  new  attributes. 

Pruning  techniques  [3]  allow  one  to  change  a  tree  near 
the.  fringe.  For  concepts  like  the  multiplexer  this  is  not 
sufficient.  Quinlan  [6]  proposes  to  transform  a  tree  into  a 
set  of  rules  which  are  subsequently  simplified  by 
selectively  deleting  conditions.  For  comparison,  note  that 
IDL  works  with  one  representation  and  that  the  pull-up 
process  modifies  several  rules  at  once  and  is  capable  of 


•o-  leaves 
-A-  nodes 

#  collapsed 


Figure  3:  IDL  as  a  postprocessor  for  TDIDT-generated  trees,  applied  to  20  identical  trees  for  the  6-multiplexer.  Left  axis: 
tree  size  plotted  in  squares  (number  of  leaves)  and  triangles  (number  of  nodes).  Right  axis:  the  number  of  fully  collapsed 

trees  at  that  moment,  plotted  with  black  circles. 
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both  introducing  and  deleting  attributes  on  a  path. 
Pruning  and  Quinlan's  technique  solve  problems  of 
noise,  while  IDL's  reliance  on  individual  instances  makes 
it  sensitive  to  noise. 

The  FRINGE  algorithm  [4]  is  designed  to  tackle  the 
problem  of  replication  of  subtrees  when  learning  decision 
trees  for  boolean  Disjunctive  Normal  Form  concepts  by 
introducing  macro-attributes.  Like  IDL  this  algorithm 
has  a  topological  flavor  and  uses  decision  trees  in  a 
bottom  up  way.  It  is  however  not  incremental  and  takes 
TDIDT  as  its  gold-standard.  In  comparison  note  that  IDL 
is  incremental,  does  not  change  language  bias  and  tackles 
the  replication  problem  for  concepts  which  do  have  a 
representation  without  replication.  Seshu  [8]  also 
introduces  macro-attributes,  though  not  based  on  an 
analysis  of  tree  topology. 

Utgoff  [10]  exploits  the  incremental  nature  of  IDSR  to 
increase  the  quality  of  learning.  The  idea  is  to  learn  only 
from  those  examples  which  are  wrongly  classified.  Tree 
size  improves  though  for  the  6-multiplexer  is  still  far 
from  optimal  (31  nodes).  This  idea  does  not  seem 
applicable  to  IDL  which  docs  most  simplification  work 
after  the  tree  has  reached  fuli  accuracy. 

Elomaa  [I]  presents  experimental  results  to  use  a 
variant  of  IDL  (without  step  3.6)  as  a  postprocessor  to 
optimize  TDIDT  generated  decision  trees.  Their  results 
show  that  for  exclusive  or  functions  IDL  efficiently 
removes  tests  for  irrelevant  attributes. 

5.2.  Future  Work 

The  role  of  the  statistical  measure  in  step  3.6  of  the 
algorithm  needs  to  be  better  understood,  in  particular 
with  respect  to  the  goodness  which  is  required  of  it.  I 
used  IDL  with  random  attribute  selection,  fixed  order 
selection  and  second  best  according  to  information  gain. 
In  all  these  cases  IDL  did  not  converge.  I  did  not  check 
whether  IDL  improves  with  some  of  the  selection 
measures  from  [2].  Further  analysis  and  experimentation 
are  needed  to  derive  average  case  estimates  for 
computation  time  and  complexity  and  to  compare  these 
with  the  worst  case  estimates  (finding  optimal  trees  is  in 
general  NP-complete...).  Run-time  measurements  must 
be  analysed  in  greater  detail.  The  IDL  algorithm  needs  to 
be  extended  to  detect  and  handle  non-convergence.  Finally 
issues  of  noise  have  not  been  investigated  yet. 

6 .  Conclusion 

This  paper  has  introduced  the  notion  of  topological 
relevance  as  a  measure  for  the  complexity  of  a  decision 
tree  representation.  It  has  presented  an  algorithm  called 
IDL  for  incremental  induction  of  topologically  minimal 
trees.  The  key  idea  is  to  use  tree  transformations  to  create 
opportunities  for  pruning,  guided  by  an  analysis  of  tree 
topology.  Empirical  results  show  that  often  IDL  rapidly 
finds  a  minimal  tree  if  it  exists.  This  work  is  continued 
with  experiments,  analysis  and  improvements  to  cope 


with  non-convergence  and  noise. 
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Abstract 

A  rational  analysis  tries  to  predict  the  behavior 
of  a  cognitive  system  from  the  assumption  it  is 
optimized  to  the  environment.  An  iterative 
categorization  algoritlim  has  been  developed 
which  attempts  to  get  optimal  Bayesian  estimates 
of  the  probabilities  that  objects  will  display 
various  features.  A  prior  probability  is  estimated 
that  an  object  comes  from  a  category  and 
combined  with  conditional  probabilities  of 
displaying  features  if  the  object  comes  from  the 
category.  Separate  Bayesian  treatments  are 
offered  for  the  cases  of  discrete  and  continuous 
dimensions.  The  resulting  algorithm  is  efficient, 
works  well  in  tlie  case  of  large  data  bases,  and 
replicates  the  full  range  of  empirical  literature  in 
human  categorization. 

A  rational  analysis  (Anderson,  1990)  is  an  attempt  to 
specify  a  theory  of  some  cognitive  domain  by 
specifying  the  goal  of  the  domain,  the  statistic^ 
structure  of  the  environment  in  which  that  goal  is 
being  achieved,  and  whatever  computational 
constraints  the  system  is  operating  under.  The 
predictions  about  the  behavior  of  the  system  can  be 
derived  assuming  that  the  system  will  maximize  *he 
goals  it  expects  to  achieve  while  minimizmg 
expected  costs  where  expectation  is  defined  with 
respect  to  the  statistical  structure  of  the  environment. 
This  approach  is  -1’  ferent  from  most  approaches  in 
cognitive  psychology  because  it  tries  to  derive  a 
theory  from  assumptions  about  the  structure  of  the 
environment  rather  than  assumptions  about  the 
structure  of  the  mind. 

We  have  applied  this  approach  to  human 
categorization  and  have  developed  a  rather  effective 
algorithm  for  categorization.  The  analysis  assumes 
that  the  goal  of  categorization  is  to  maximize  the 
accuracy  of  predictions  about  features  of  new  objects. 
For  instance,  one  might  want  to  predict  whether  an 
object  is  dangerous  or  not.  This  approach  to 
categorization  sees  nothing  special  about  category 
labels.  Tlie  fact  an  object  might  be  called  a  tiger  is 
just  another  feature  one  might  want  to  predict  about 
the  object. 


The  Structure  of  the  Environment 

It  is  an  interesting  question  what  kind  of  stnicture  we 
can  assume  of  the  environment  in  order  to  drive 
prediction.  The  theory  developed  rested  on  the 
structure  of  biological  categones  produced  by  the 
phenomenon  of  species.  Species  form  a  nearly 
disjoint  partitioning  of  the  natural  objects  because  of 
the  inability  to  interbreed.  Within  a  species  there  is  a 
common  genetic  pool  which  means  that  individual 
members  of  the  species  will  display  particular  feature 
values  with  probabilities  that  reflect  the  proportion  of 
that  phenotype  in  the  population.  Another  useful 
feature  of  species  structure  is  that  the  display  of 
feauires  within  a  fteely-inteibreeding  species  is 
largely  independent.  Thus,  there  is  little  relationship 
between  size  and  eye  color  in  species  where  tliose 
two  dimensions  vary.  Thus,  tlie  critical  aspects  of 
speciation  is  the  disjoint  partitioning  of  the  object  set 
and  the  independent  probabilistic  display  of  features 
within  a  species. 

An  interesting  question  is  whether  other  types  of 
objects  display  these  same  properties.  Another 
common  type  of  object  is  the  artifact.  Artifacts 
approximate  a  disjoint  partitioning  but  there  are 
occasional  exceptions--for  instance,  mobile  homes 
which  are  both  homes  and  vehicles.  Other  types  of 
objects  (stones,  geological  formations,  heavenly 
bodies,  etc)  seem  to  approximate  a  disjoint 
partitioning  but  here  it  is  hard  to  know  whether  this  is 
just  a  matter  of  our  perceptions  or  whether  there  is 
any  objective  sense  in  which  they  do.  One  can  use 
the  understanding  of  speciation  for  natural  kinds  and 
understanding  of  the  intended  function  in 
manufacture  for  artifacts  to  objectively  assess  the 
hypothesis  of  a  disjoint  partitioning. 

We  have  taken  this  disjoint,  probabilistic  model  of 
categories  and  used  it  as  the  understanding  of  the 
structure  of  the  environment  for  doing  prediction 
abuui  ubjeu  features.  To  maximize  tiie  prediction  of 
features  of  objects  we  need  to  induce  a  disjoint 
partitioning  of  the  object  set  into  categories  and 
determine  what  the  probability  of  features  will  be  for 
each  category.  The  ideal  prediction  function  would 
be  described  by  the  following  formula: 

=  Y^P{x\P„)Probim 
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where  Predjj  is  the  probability  an  object  will  display  a 
value  j  on  a  dimension  i  which  is  not  observed  for 
that  object,  the  summation  is  across  all  possible 
partitionings  of  the  n  objects  seen  into  disjoint  sets, 
P(xlF„)  is  the  probability  of  partitioning  x  given  the 
objects  display  observed  feature  structure  F^,  and 
Probj  (jlx)  is  the  probability  the  object  in  question 
would  display  value  j  on  dimension  i  if  x  were  the 
partition.  The  problem  with  this  approach  is  that  the 
number  of  partitions  of  n  objects  grows  exponentially 
as  the  Bell  exponential  number  (Berge,  1971). 
Assuming  that  humans  cannot  consid.'r  an 
exponentially  exploding  number  of  hypothesis  we 
were  motivated  to  explore  iterative  algorithms  such 
as  those  developed  by  Fisher  (1987)  and  Lebowitz 
(1987). 

The  following  is  a  formal  specification  of  the 
iterative  algorithm: 

1.  Before  seeing  any  objects,  the  category 
partitioning  of  the  objects  is  initialized 
to  be  the  empty  set  of  no  categories. 

2.  Given  a  partitioning  for  the  first  m 
objects,  c^culate  for  each  category  k 
the  probability  that  the  m+Jst  object 
comes  from  category  k.  Let  be  the 
probability  that  the  object  comes  from  a 
completely  new  category. 

3.  Create  a  partitioning  of  the  m+1  objects 
with  the  m+lst  object  assigned  to  the 
category  with  maximum  probability. 

4.  To  estimate  the  probability  of  value  j  on 
dimension  /  for  the  n+lst  object 
calculate 

Predij  =  ^  Pk  Equation  1 

k 

where  P/^  is  the  probability  the  n+lst  object  comes 
from  category  k  and  Pdj/k)  Ls  the  probability  of 
displaying  value  j  on  dimension  i. 

The  basic  algorithm  is  one  in  which  the  category 
structure  is  grown  by  assigning  each  incoming  object 
to  the  category  it  is  most  likely  to  come  from.  Thus, 
a  specific  partitioning  of  the  objects  is  produced. 
Note,  however,  that  the  prediction  for  the  new  n+lst 
object  is  not  calculated  by  determining  its  most  likely 
category  and  the  probability  of  j  given  that  category. 
This  calculation  is  performed  over  all  categories. 
This  eives  a  much  more  accurate  annroximation  to 

k  k 

the  ideal  Pred^^  because  it  handles  situations  where 
the  new  object  is  ambiguous  between  multiple 
categories.  It  will  weight  approximately  equally 
these  competing  categories. 

The  algorithm  is  not  guaranteed  to  produce  the 
maximally  probable  partitioning  of  the  object  set 
since  it  only  considers  partitionings  that  can  be 
incrementally  grown.  It  also  does  not  weight 


multiple  possible  partitionings  as  the  ideal  algorithm 
would.  In  cases  of  strong  category  structure,  there 
will  be  only  one  probable  partitioning  and  the 
iterative  algorithm  will  uncover  it.  In  cases  of  weak 
category  stmeture,  it  will  often  fail  to  obtain  the  ideal 
partitioning,  but  still  the  predictions  obtained  by 
Equation  1  closely  approximate  the  ideal  quantity 
because  of  the  weighting  of  multiple  categories.  We 
observe  correlations  about  .95  between  the 
predictions  of  our  algorithm  and  the  ideal  quantities 
in  cases  of  small  data  sets. 

It  remains  to  come  up  with  a  formula  for 
calculating  Ft  and  P(ijlk).  Since  P(ijlk)  proves  to  be 
involved  in  the  definition  of  P/^,  we  will  focus  on  P/^. 
In  Bayesian  terminology  is  a  posterior  probability 
P(klF)  that  the  object  belongs  to  category  k  given  that 
it  has  feature  structure  F.  Bayes  formula  can  be  used 
to  express  this  in  tenns  of  a  prior  probability  P(k)  of 
coming  from  category  k  before  the  feature  structure  is 
inspected  and  a  conditional  probability  P(F/k)  of 
displaying  the  feature  structure  F  given  that  it  comes 
from  category  k. 

Pik)pm) 

*  Equation  2 

where  the  summation  in  the  denominator  is  over  all 
categories  k  currently  in  the  partitioning  including  the 
potential  new  one.  This  then  focuses  our  analysis  on 
the  derivation  of  a  prior  probability  F(it)  and  a 
conditional  probability  PiFjk). 

Prior  Probability 

With  respect  to  prior  probabilities  the  critical 
assumption  is  that  there  is  f  fired  probability  c  that 
any  two  objects  come  from  i.’::  sojiio  category  and 
this  probability  does  not  deper»d  on  the  number  of 
objects  seen  so  far.  This  is  called  the  coupling 
probability.  If  one  takes  this  assumption  about  the 
coupling  probability  between  two  objects  being 
independent  of  the  other  objects  and  generalizes  it, 
one  can  derive  a  simple  form  for  P(k)  (See  Anderson, 
1990,  for  the  derivation): 

cn/^ 

'’W  =  (l-c)  K.  C 

where  c  is  the  coupling  probability,  n^.  is  the  number 
of  objects  assigned  to  category  1:  so  far,  and  «  is  the 
total  number  of  objects  seen  so  far.  N'  te  for  large  n 
this  closely  approximates  n/Jn  which  means  that  we 
have  a  strong  base  rate  effect  in  these  calculations 
with  a  bias  to  put  new  objects  into  large  categories. 
Presumably  the  rational  basis  for  this  is  apparent. 

We  also  need  a  formula  for  P(0)  which  is  the 
probability  that  the  new  object  comes  from  an 
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entirely  new  category.  This  is 

P{0)  =  —  Equation  4 

(1-c)  +  cn  ^ 

For  large  n  this  closely  approximates  (l-c)lcn 
which  is  again  a  reasonable  form-i.e.,  the  probability 
of  a  brand  new  category  depends  on  the  coupling 
probability  and  number  of  objects  seen.  The  greater 
the  coupling  probability  and  the  more  objects,  the 
less  likely  it  is  that  the  new  object  comes  from  an 
entirely  new  category. 

Conditional  Probability 

We  can  consider  the  probability  of  displaying 
features  on  various  dimensions  given  category 
membership  to  be  independent  of  the  probabilities  on 
other  dimensions.  Then  we  can  write 

P{F\k)  =  P(iM)  Equations 

i 

Where  P(ijlk)  is  the  probability  of  displaying  value 
j  on  dimension  i  given  that  one  comes  ^m  category 
k. 

This  independence  assumption  does  not  prevent  us 
from  recognizing  categories  with  correlated  features. 
Thus,  we  may  know  that  being  black  and  retrieving 
sticks  are  features  found  together  in  labradors.  This 
would  be  represented  by  high  probabilities  of  the 
stick-retrieving  and  the  black  features  in  the  labrador 
category.  What  the  independence  assumption 
prevents  us  from  doing  is  representing  categories 
where  values  on  two  dimensions  are  either  both  one 
way  or  both  the  opposite.  Thus,  it  would  prevent  us 
from  recognizing  a  single  category  of  animals  which 
were  either  large  and  fierce  or  small  and  gentle,  for 
instance.  However,  this  turns  out  not  to  be  a  very 
serious  limitation.  What  our  algorithm  does  in  this 
case  is  to  spawn  a  different  category  to  capture  each 
two-feature  combination-it  would  create  a  category 
of  large  and  fierce  aeatures  and  another  category  of 
small  and  gentle  creatures. 

The  effect  of  Equation  (5)  is  to  focus  us  down  on 
an  analysis  of  the  individual  P(ijlk).  Derivation  of 
this  quantity  is  itself  an  exercise  in  Bayesian  analysis. 
We  will  treat  separately  discrete  and  continuous 
dimensioas. 

DisCi'^te  DiilieiiStOiiS 

The  basic  Bayesian  strategy  for  doing  inference  along 
a  dimension  is  to  assume  a  prior  distribution  of 
values  along  the  dimension,  determine  the 
conditional  probability  of  tlie  data  under  various 
possible  values  of  the  priors,  and  then  calculate  a 
posterior  distribution  of  possible  values.  The 
common  practice  is  to  start  with  a  rather  weak 


distribution  of  possible  priors  and  as  more  and  more 
data  accumulates  come  up  with  a  tighter  and  tighter 
posterior  distribution. 

In  the  case  of  a  discrete  dimension,  the  typical 
Bayesian  analysis  (Berger,  1985)  is  to  assume  that 
the  prior  distribution  is  a  Diiichlet  density.  For  a 
dimension  with  m  values  a  Dirichlet  distribution  is 
characterized  by  m  parameters  aj.  We  can  define 

probability  of  the  jth  value  is  p, 
=  a/ttp.  The  value  reflects  the  strength  of  belief 
in  tnese  priors  probabilities,  Pj.  The  data  after  n 
observations  will  consist  of  a  set  of  C,  counts  of 
observations  of  value  j  on  dimension  i.  Tne  posterior 
distribution  of  probabilities  is  also  a  Dirichlet 
distribution  but  with  parameters  a^+Cj.  This  implies 
that  the  mean  expected  value  of  displaying  value  j  in 
dimension  i  is  (Oj+C)/]^  (otj+Cj).  Tliis  is  P(ijlk)  for 
Equation  5; 

C  ■  +  a,. 

P{ij\k)  =  -  Equation  6 

+  «o  ^ 

where  is  the  number  of  objects  in  category  k  which 
have  a  value  on  dimension  /  and  Cj  is  the  number  of 
objects  in  category  k  with  the  same  value  as  the 
object  to  be  classified.  For  large  this  approximates 
Cyn^  which  one  frequently  sees  promoted  as  the 
rational  probability.  However,  it  has  to  have  this 
moie  complicated  fomi  to  deal  with  problems  of 
small  samples.  For  instance,  if  one  has  just  seen  one 
object  in  a  category  and  it  has  had  the  color  red,  one 
would  not  want  to  guess  that  all  objects  are  red.  If 
we  assume  there  are  seven  colors  and  all  the  Oj  weie 
1,  the  above  formula  would  give  1/4  as  the  posterior 
probability  of  red  and  1/8  for  the  other  six  colors 
unseen  as  yet 

Continuous  Dimensions 

Application  of  Bayesian  inference  schemes  to 
continuous  dimensions  is  more  problematic  but  there 
is  one  approach  that  appears  most  tractable  (Lee, 
1989).  The  natural  assumption  is  that  the  variable  is 
distributed  normally  and  the  induction  problem  is  to 
infer  the  mean  and  variance  of  that  distribution.  In 
standard  Bayesian  inference  methodology  we  must 
begin  with  some  prior  assumptions  about  what  the 
mean  and  variance  of  this  distribution  is.  It  is 
unreasonable  to  suppose  we  can  know  in  advance 
what  the  precisely  what  either  the  mean  and  variance 
will  be.  Our  prior  knowledge  must  take  the  form  of 
probability  densities  over  possible  means  and 
variances.  This  is  basically  the  same  idea  as  in  the 
discrete  case  where  we  had  a  Dirichlet  distribution 
giving  priors  about  probabilities  of  various  values. 
The  major  complication  is  the  need  to  state  separately 
prior  distributions  for  mean  and  variance. 
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The  tractable  suggestion  for  the  prior  distributions 
is  that  the  inverse  of  the  variance  I?-  is  distributed 
according  to  a  chi-square  distribution  and  the  mean 
has  a  normal  distribution.  Given  these  priors,  the 
posterior  distribution  of  values  x  on  a  continuous 
dimension  i  for  category  k,  after  n  observations  has 
the  following  t  distribution; 


f^x\k)  -  Equation  7 

The  parameters  aj,  P;,  Oj,  and  are  defined  as 


follows: 


=  Kq  +  n 


Equation  8 


fl,.  =  flo  +  M 


Equation  9 


“VwT 


Equation  10 


\n 


0,2  = 


floOo  +  (n-l)i‘‘  +  Xo  +  -  Mo) 


<7o  +  « 


Equation  1 1 


where  ,7  is  the  mean  of  the  n  observations  and  s^  is 
their  variance.  These  equations  basically  provide  us 
with  a  formula  for  merging  the  prior  mean  and 
variance,  p^  and  Oq2,  with  the  empirical  mean  and 

variance,  xand  s^,  in  a  manner  that  is  weighted  by  our 
confidences  in  these  priors,  Xq  and  Bq. 

Equation  7  for  the  continuous  case  describes  a 
probability  density  which  serves  the  same  role  as 
Equation  6  for  the  discrete  case  which  describes  a 
probability.  The  product  of  conditional  probabilities 
in  Equation  5  will  then  be  a  product  of  probabilities 
and  density  values.  Basically,  Equations  (5),  (6),  and 
(7)  give  us  a  basis  for  judging  how  similar  an  object 
is  to  the  category’s  central  tendency. 

Conclusion 

This  completes  our  specification  of  the  theory  of 
categorization.  Before  looking  at  its  application  to 

r»V»f*nAmon»>  n  tirorH  Af*  AOiifJnn  fC  in 

order.  The  claim  is  not  that  the  human  mind 
perfonns  any  of  the  Bayesian  mathematics  that  fills 
the  preceding  pages.  Rather  the  claim  of  the  rational 
analysis  is  that,  whatever  the  mind  does,  its  output 
must  be  optimal.  The  mathematical  analyses  of  the 
preceding  pages  serve  the  function  of  allowing  us,  as 
theorists,  to  determine  what  is  optimal. 


A  second  comment  is  in  order  concerning  the 
output  of  the  rational  analysis.  It  delivers  a 
probability  that  an  object  will  display  a  particular 
feature.  There  remains  the  issue  of  how  this  relates 
to  behavior.  Our  basic  assumption  will  only  be  that 
there  is  a  monotonic  relationship  between  these 
probabilities  and  behavioral  measures  such  as 
response  probability,  response  latency,  and 
confidence  of  response.  Hie  exact  mapping  will 
depend  on  such  things  as  the  subject’s  utilities  for 
various  possible  outcomes,  the  degree  to  which 
individu^  subjects  share  the  same  priors  and 
experiences,  and  the  computational  costs  of  achieving 
various  possible  mappings  from  rational  probability 
to  behavior.  These  are  all  issues  for  fitture 
exploration.  What  is  remarkable  is  how  well  we  can 
fit  the  data  simply  assuming  a  monotonic 
relationship. 

Application  of  the  Algorithm 

We  have  applied  the  algorithm  to  a  number  of 
examples  to  illustrate  its  properties.  The  algorithm  is 
quite  efficient.  A  Franz  LISP  implementation 
categorized  the  290  items  from  Michalski  and 
Chilausky’s  data  set  on  Soybean  disease  (each  with 
36  values)  in  1  CPU  minute  on  a  Vax  780  or  a  MAC 
II.  This  is  without  any  special  effort  to  optimize  the 
code.  It  also  diagnosed  the  test  set  of  340  soybean 
instances  with  as  much  accuracy  as  apparently  did 
the  original  system  of  Michalski  and  Chilausky. 

The  algorithm  has  been  applied  to  the  full  range  of 
psychological  experiments  in  categorization. 
Detailed  discussions  can  be  found  in  Anderson  (in 
press)  and  Anderson  &  Matessa  (in  preparation). 
However,  we  will  review  here  in  varying  detail  the 
^plications  of  the  algorithm  to  10  empirical 
phenomena.  All  these  simulations  were  done  with  a 
constant  setting  of  the  parameters:  c  from  Equation  3 
and  4  at  .3,  ttj  fi-om  Equation  6  at  1,  Xq  from  Equation 
8  at  I,  Oq  from  Equation  9  at  1,  firom  Equation  10 
at  the  mean  of  the  stimuli,  and  from  Equation  1 1 

at  the  square  of  1/4  the  stimulus  range.  All  of  these 
ate  plausible  settings  and  often  correspond  to 
conventions  for  setting  Bayesian  non-informative 
priors.  The  following  are  among  the  empirical 
phenomena  we  have  successfully  simulated: 

1.  Extraction  of  Central  Tendencies.  Continuous 
Dimeasions  The  Bayesian  model  for  continuous 
dimensions  implies  that  categorization  should  vary 
with  distance  from  central  tendency.  This  enables  the 
model  to  simulate  the  data  of  Posner  &  Keele  (1968) 
on  categorization  of  dot  patterns  and  Reed  (1972)  on 
categorization  of  faces.  Let  us  consider  tlie 
experiment  of  Reed  in  a  little  detail: 

Reed  (1972)  had  subjects  learn  to  categorize  the  10 
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faces  which  are  illustrated  in  Figure  1.  The  first  row 
of  faces  are  in  one  category  and  the  second  row  of 
faces  are  in  another  category.  The  two  sets  of  faces 
are  deviations  from  underlying  prototypes.  After 
studying  these  faces  subjects  went  to  a  test  condition 
where  they  had  to  try  to  classify  these  and  other 
faces.  The  critical  data  concerns  t^  probability  with 
which  subjects  assigned  faces  to  conditions.  As  a 
general  characterization,  their  categorization  varied 
with  distance  of  the  face  from  the  prototype. 

Figure  1 


In  our  attempt  to  simulate  these  data  we  treated 
these  faces  as  five-dimensional  stimuli  where  the 
dimensions  are  height  of  the  forehead  which  ranged 
from  54  to  88  mm,  distance  separation  of  the  eyes 
which  ranged  from  20  to  55  mm,  length  of  the  nose 
which  ranged  from  32  to  64  mm,  height  of  the  mouth 
which  ranged  from  28  to  60  mm,  and  category  label 
which  w^  a  binary-valued  discrete  dimension.  Our 
rational  model  identified  two  or  more  internal 
categories,  depending  on  presentation  order,  that 
corresponded  to  the  experimenter’s  categories.  That 
is,  sometimes  it  subdivided  the  erqrerimenter’s 
categories  into  subcategories  but  it  almost  never 
merged  items  from  the  two  experimenter  categories 
into  an  internal  category.  Reed’s  subjects  were  asked 
to  classify  25  test  stimuli  and  the  major  test  of  our 
model  was  its  classification  of  these  test  stimuli. 
Overall  its  confidence  of  category  membership 
(calculated  by  Equation  1)  correlated  .90  with  Reed’s 
data.  ’ 

2.  Extraction  of  Central  Tendencies,  Discrete 
Dimensions  The  model  implies  that  stimuli  should  be 
better  categorized  if  they  display  the  majority  value 
for  a  dinisnsion.  Tliis  snablsd  ths  niodsl  to  siniulato 
tlie  data  of  Hayes-Rotli  &  Hayes-Roth  (1977),  for 
instance. 


sufficiently  different  than  the  central  tendency  for  its 
assigned  category,  the  model  will  form  a  distinct 
category  for  it.  TTiis  enables  the  model  to  account  for 
the  data  of  Medin  &  Schaffer  (1978)  on  discrete 
dimensions  and  Nosofsky  (1988)  on  continuous 
dimensions.  Let  us  consider  the  experiment  of 
Nosofsky: 

Figure  2 
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Nosofrky  trained  his  subjects  on  12  stimuli  that 
varied  in  brightness  and  saturation.  The  colors  varied 
in  brightness  on  the  Munsell  scale  from  3  to  7  and  in 
saturation  from  4  to  12.  In  the  base  condition 
subjects  had  four  trials  on  each  item  and  were  then 
tested.  In  the  first  experiment  there  was  a  condition 
E2  in  which  subjects  saw  stimulus  2  approximately  5 
times  as  frequently  and  a  condition  E7  in  which  they 
saw  stimulus  7  approximately  5  times  as  frequently. 
Part  (a)  of  Figure  2  illustrated  probability  of 
classification  in  Category  2.  As  can  be  seen  subjects 
ate  sensitive  to  the  fiequency  manipulation.  Part  (b) 
of  Figure  2  shows  tlie  probability  our  model  as.sigiied 
to  a  Category  2  response  given  tlie  same  experience. 
The  overall  correlation  between  data  and  theory  is 
.98. 


3.  Effect  of  Individual  Instances  If  an  instance  is 


’Wc  would  like  to  thank  Stephen  Reed  for  making  his  data 
available. 


4.  Linearly  Separable  versus  Non-Linearly 
Separable  Categories  Unlike  some  categorization 
models  this  model  is  able  to  learn  categories  that 


cannot  be  separated  by  a  plane  in  a  n-dimensional 
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hyperspace.  This  is  because  it  can  form  multiple 
internal  categories  to  correspond  to  an  experimenter’s 
category.  This  enables  the  model  to  account  for  the 
data  of  Medin  &  Schwanenflugel  (1981)  on  discrete 
dimensions  and  Nosofsky,  Clark,  &  Shin  (1989)  on 
continuous  dimensions.  Let  us  consider  the 
exjperiment  of  Medin  &  Schwanenflugel.  They 
performed  an  experiment  where  linearly  non- 
separable  categories  were  learned  better  than  linearly 
separable  categories. 

Table  1  illustrates  the  material  used  by  Medin  & 
Schwanenflugel  (1981).  In  the  case  of  the  linearly 
separable  categories  our  model  formed  separate 
categories  for  each  stimulus.  In  the  case  of  linearly 
non-separable,  it  merged  the  first  2  in  category  A  into 
an  internal  category,  the  second  2  in  category  A,  and 
the  fust,  second,  and  fourth  in  category  B.  Thus,  only 
stimulus  3  in  category  B  was  in  a  singleton  category 
and  this  was  the  stimulus  that  produced  the  highest 
error  rate  in  the  non-separable  condition. 

Table  1 


CATEQOHY  t 

OIMfNSIQN 
0,  Oj  0, 0, 
A,  I  I  I  0 

Aj  I  0  I  I 

Aj  110  1 

A4  0  111 


UTEfiOBT  D 

oiytwaoN 
EXEMPUatt  0,02  OjOt 
B,  1  0  t  0 

B2  0  110 

8}  0  0  0  1 

8,  1100 


5.  Basic-Level  Categories  The  internal  categories 
that  the  model  extracts  corresponds  to  what  Rosch 
(1976)  meant  by  basic-level  categories.^  Thus,  it  can 
simulate  the  data  of  Murphy  &  Smith  (1982)  and 
Hoffman  &  Ziessler  (1983).  We  will  describe  the 
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Murphy  and  Smith  presented  to  their  subjects  16 


^Roscli's  idea  of  a  basic  level  is  that  there  is  a  level  in  tlie 
generalization  iiierarcliy  to  which  we  first  assign  objects.  For 
instance,  she  argues  we  would  first  see  an  object  as  a  bird  not  a 
sparrow  or  an  animal. 


objects  identified  as  examples  of  fictitious  tools.  The 
structure  of  the  material,  as  encoded  by  Gluck  and 
Colter  (1985),  is  illustrated  in  Table  2.  There  were 
two  superordinate  categories  which  divided  into  4 
intermediate  categories,  which  divided  into  8 
subordinate  categories.  Table  2  gives  the  attribute 
description  of  each  category.  Subjects  were  fastest  to 
classify  the  material  at  the  intermediate  level  which 
Murphy  and  Smith  intended  to  be  the  basic  level. 
Objects  at  this  level  had  two  attributes  plus  two  labels 
in  common.  Only  one  additional  feature  and  label 
was  gained  at  the  subordinate  level,  and  all  features 
were  lost  at  the  superordinate  level  except  for  their 
feature  of  being  a  pounder  or  a  cutter. 

Table  2 

Gluck  And  Corter't  AnAlyals  ot  tht  Ftaturt  Structure  of  the  Materiel  from 
Murphy  &  Smith  (1962) 


Ittm  t 

Cotttorm 

Atmbuies 

Suptr‘ 

0rdif»Ott 

/rtfer* 

rntdatt 

Su}y 

ordineit 

Handle 

Shaft 

Head 

See 

1. 

Pounder 

Himmer 

Himmerl 

2 

2 

0 

0 

2. 

2 

2 

0 

1 

y. 

H&mmer  2 

2 

2 

1 

0 

4. 

2 

2 

1 

0 

3 

Brick 

Brick  1 

0 

y 

4 

0 

6 

0 

y 

4 

\ 

7. 

Brick  2 

I 

y 

4 

0 

1. 

1 

y 

4 

1 

9. 

Cutter 

kmfe 

Knifel 

3 

4 

2 

0 

10. 

y 

4 

2 

1 

U. 

Kiurt2 

y 

4 

) 

0 

12. 

y 

4 

) 

1 

12. 

PiLU  C. 

P.C.I 

4 

0 

3 

0 

14. 

4 

0 

3 

1 

13. 

P.C.2 

4 

1 

3 

0 

16. 

4 

1 

3 

1 

We  modeled  this  material  by  encoding  the  stimuli 
as  7-dimensional  objects  with  dimensions  for  the 
superordinate  label  (2  values),  the  intermediate  label 
(4  values),  the  subordinate  label  (8  values),  handle  (5 
values),  shaft  (5  values),  head  (6  values),  and  size  (2 
values).  What  category  structure  was  obtained 
depended  upon  the  value  of  the  coupling  probability. 
For  c  >  .96  all  were  merged  into  one  category;  for  .95 
>  c  >  .8  the  two  superordinate  categories  emerged; 
for  .8  >  c  >  .4  the  model  fluctuated  between  the 
superordinate  and  intermediate  categories  depending 
on  presentation  order;  for  .4  >  c>  .2  it  extracted  just 
the  intermediate  categories;  for  .2  >  c  >  .05  it 
basically  extracted  the  intermediate  categories  with 
an  occasional  singleton  category  or  subordinate 
category;  for  c  <  .05  it  extracted  only  singleton 
categones.  In  summary,  the  subordinate  categories 
never  emerged  and  only  a  very  high  levels  of  c  did 
superordinate  categories  dominate.  At  the  value  of  c 
used  in  the  simulations  of  this  paper  (c  =  .3)  only  tiie 
basic  level  categories  emerged.  Thus,  it  seems  fair  to 
conclude  that  the  analysis  agrees  with  the  subjects  as 
to  what  the  basic  level  is. 

6.  Probability  Matching  Faced  with  truly 
probabilistic  categories  and  large  samples  of 
instances  the  model  will  estimate  probability  of 
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features  that  correspond  exactly  to  the  empirical 
proportion.  Thus,  it  predicts  the  data  of  Gluck  & 
Bower  (1988)  on  probability  matching. 

7.  Base-Rate  Effect  Bepause  of  Equation  3  this 
model  predicts  that  usually  there  will  be  a  greater 
tendency  to  assign  items  to  categories  of  large  size. 
Thus,  it  handles  the  data  of  Homa  &  Cultice  (1984). 
It  also  reproduces  the  more  subtle  interactions  of 
Medin  &  Edelson  (1988). 

8.  Correlated  Features  As  noted  earlier  the  model 
can  handle  categories  with  correlated  features  by 
breaking  out  separate  internal  categories  for  each 
feature  combination.  Tlius,  it  handles  the  data  of 
Medin,  Altom,  Edelson,  &  Fteko  (1982).  They  had 
subjects  study  the  9  cases  in  Table  3  which  were  all 
supposed  to  represent  instances  from  one  disease 
category,  burlosis.  This  was  simulated  by  presenting 
these  9  cases  to  the  model  with  a  sixth  ^mension,  a 
disease  label  which  was  always  burlosis.  This  was 
arbitrarily  treated  this  as  a  binary  dimension.  Note 
that  each  of  the  five  symptoms  show  a  majority  of 
ones  associated  with  the  disease. 

Tables 

SYMPTOMS  OF  BURLOSIS 
from  Medin  et  al.  (1982) 


Coit 

Blood 

Skin 

Mus<lf 

Condition 

M  ti$ht 

Siy^y 

Pftitutt 

Condition 

Condition 

o/E,\ts 

Condition 

l-RL 

0 

I 

0 

1 

1 

2.LF. 
i  J.i 

I 

0 

\ 

0 

0 

1 

1 

I 

1 

1 

*.  R..M 

5  A.M. 

6  J.S. 

1 

1 

1 

0 

I 

1 

1 

J 

1 

1 

1 

1 

1 

1 

1 

^  S.T 

I 

0 

0 

0 

0 

S  SE. 

0 

1 

I 

0 

0 

9.  EM. 

1 

I 

1 

0 

0 

Not9  Zero  denotes  ebsence  of  the  symptom  and  l  denotes  presence 


The  critical  feature  of  these  materials  from  the 
perspective  of  correlated  features  concerns  the  fourth 
dimension  of  conditions  of  eyes  and  the  fifth 
dimension  of  weight.  Values  are  either  both  1  or 
both  0.  The  first  six  items  in  Table  3.5  have  two  1  ’s; 
the  last  three  have  two  O’s.  Subjects  are  sensitive  to 
this  correlation.  When  these  stimuli  were  fed  into  the 
algorithm  with  c=.3,  it  typically  extracted  3 
categories-~one  to  represent  the  first  six  items,  one 
for  the  seventh,  and  one  for  the  last  two.  Thus,  the 
way  it  dealt  with  correlated  features  was  to  break  out 
separate  categories  for  the  different  possible  values  of 
tile  coneuition. 

9.  Effects  of  Feedback  If  the  category  structure  of 
the  stimuli  is  strong  enough  the  model  can  extract  the 
categories  without  any  feedback  as  to  category 
identity.  In  tlie  face  of  weak  category  structure,  it  is 
necessary  to  provide  category  labels  to  get  learning. 
Thus,  tills  model  reproduces  the  data  of  Homa  & 


Cultice  (1984). 

Figure  3  illustrates  the  stimulus  material  of  Homa 
and  Cultice.  They  are  derived  from  the  random  9-dot 
patterns  introduced  by  Posner  &  Keele  (1968)  but 
Homa  has  introduced  the  feature  of  drawing  lines  to 
connect  the  dots.  This  makes  it  relatively  che^  to 
write  a  computer  program  that  will  determine  how  to 
map  the  points  of  one  into  another  in  a  way  as  to 
achieve  maximal  fit.  Given  such  a  mapping,  we  can 
describe  each  stimulus  according  to  18  ordered 
dimensions  which  are  the  x  and  y  coordinates  of  each 
point.  Then  we  can  apply  our  categorization 
algorithm  to  these  materials. 

Figure  3 


p\ 

<f\ 

(\ 

V 

/I 

There  are  three  categories  in  Figure  3-one 
category  represented  by  9  items,  one  by  6,  and  one  by 
3.  In  one  condition  of  their  experiment,  subjects 
were  given  category  labels  and  trained  to  sort  the 
stimuli  into  three  categories,  in  another  condition 
they  were  free  to  sort  the  stimuli  into  whatever 
categories  they  wanted.  Homa  &  Cultice  were 
interested  in  determining  how  well  subjects  did  at 
recovering  the  category  structure  without  feedback. 
In  the  case  of  feedback,  Homa  &  Cultice  just 
measured  accuracy  of  assignment  in  a  final  criteria 
test.  In  the  case  of  no  feedback,  they  tried  to  discover 
some  way  of  assigning  labels  to  the  categories  in  the 
subjects’  sort  that  made  their  categorization  look 
optimal.  It  is  hard  to  know  how  comparable  the  two 
measures  are. 

In  our  case,  when  there  was  feedback,  we 
measured  the  probability  of  a  category  label 
according  to  Equation  1.  When  there  was  no 
feedback,  we  assigned  labels  to  internal  categories  in 
such  a  way  as  to  maximize  probability  of  a  correct 
label  assignment  when  Equation  1  was  used.  Again 
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it  is  unclear  how  comparable  our  two  measures  were. 
In  our  case  we  corrected  our  measures  for  guessing. 
We  ran  a  control  condition  where,  rrither  than  letting 
the  algorithm  decide  which  items  go  together,  we 
randomly  assigned  items  to  internal  categories  and 
then  proceeded  to  get  performance  scores  in  the  same 
way  as  when  the  algorithm  did  the  assignment.  Thus, 
we  got  two  measures--?,  a  mean  probability  of  the 
correct  category  label  when  our  algorithm  did  the 
clustering  and  O,  a  mean  probability  of  category 
labeling  in  the  control  conrUtion  when  we  did  the 
clustering  randomly.  Our  final  measure  was  (P  -  G)  / 
(1  -  G)  which  is  a  standard  correction-for-guessing 
formula. 

Figure  4 


(a) 


Homa  and  Cultice  used  a  number  of  different 
training  sets  including  a  low  distortion  training  set 
where  the  points  were  perturbated  1.1  units  (the 
examples  in  Figure  3  are  1.1  distortions)  and  a  high 
distortion  set  where  they  were  perturbated  4.8  units. 
Figure  4  compares  the  performance  of  the  subjects 
and  the  simulation  for  high  and  low  distortion 
training  stimuli  in  the  presence  of  label  feedback  or 
not.  In  the  ca.se  of  Homa  &  Cultice,  we  used  a 
correction  for  guessing  measure  to  but  set  the 
guessing  rate  to  be  .33  since  there  were  3  categories. 
Boiii  subjects  and  SHnulations  show  approximately 
additive  effects  of  the  two  dimensions.  Both  the 
subjects  and  the  simulation  are  nearly  at  chance  in  the 
presence  of  high  distortion  stimuli  with  no  label 
feedback.  However,  our  model  does  show  greater 
sensitivity  to  feedback. 

10.  Effects  of  Input  Order  In  tire  presence  of 
weak-category  structure,  the  categories  the  model 


forms  is  sensitive  to  presentation  order.  In  this  way 
we  are  able  to  simulate  the  data  of  Anderson  (1990) 
md  Elio  &  Anderson  (1984). 

Comparisons  to  Cheeseman,  Kelly, 
Self,  Stutz,  Taylor,  &  Freeman  (1988) 

The  Bayesian  character  of  this  classification  model 
raises  the  issue  of  its  relationship  to  the  Autoclass 
model  of  Cheeseman  et  al.  While  it  is  hand  to  know 
how  significant  the  differences  are,  there  are  a 
number  of  points  of  contrast: 

Algorithm  Rather  than  an  algorithm  that  iteratively 
incorporates  instances  into  an  existing  category 
structure,  Cheeseman  et  al.  use  a  parameter  searching 
program  that  looks  for  the  best  fitting  set  of 
parameters.  Not  enough  information  is  provided  to 
compare  the  two  algorithms  with  respect  to  efficiency 
or  probability  of  identifying  the  optimal  structure. 
Presumably,  Autoclass  is  independent  of  the  order  of 
the  examples. 

Number  of  Classes  Autoclass  has  a  bias  in  favor  of 
fewer  classes  whereas  this  bias  is  setable  in  the 
rational  model  according  to  the  parameter  c. 
Autoclass  does  not  calculate  a  prior  corresponding  to 
the  probabilities  of  various  partitionings. 

Conditional  Probabilities  It  appe.js  Autoclass  uses 
the  same  Bayesian  model  as  we  do  for  discrete 
dimensions.  The  treatment  of  continuous  dimensions 
is  somewhat  different  although  we  cannot  discern  its 
exact  mathematical  basis.  The  posterior  distribution 
is  a  nonnal  distribution  which  will  only  be  slightly 
different  than  the  t-distiibution  we  use.  Both 
Autoclass  and  the  rational  model  assume  the  various 
distributions  are  independent. 

Qualitatively,  the  most  striking  difference  is  that 
AUTOCLASS  derives  a  probability  of  an  object 
belonging  to  a  class  whereas  the  rational  model 
assigns  the  object  to  a  specific  class.  However, 
Cheeseman  et  al.  report  that  in  the  case  of  strong 
category  structure  the  probability  is  very  high  that  the 
object  comes  from  a  single  category. 
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Abstract 

Our  research  adapts  incremental  concep¬ 
tual  clustering  (or  concept  formation)  to  i 
the  task  of  learning  to  guide  search.  We  ' 
build  on  earlier  research  that  uses  concept 
induction  techniques  to  learn  search  con¬ 
trol,  but  our  approach  differs  by  virtue  of  its 
reliance  on  probabilistic,  hierarchical  clas¬ 
sification  schemes  that  increase  certain  as¬ 
pects  of  search  efficiency.  The  system  also 
includes  inductive  strategies  of  ‘noise  tol¬ 
erance’  that  mitigate  problems  of  control 
knowledge  ‘utility’.  A  general  lesson  is  that 
recently  identified  search  ‘utility’  problems 
are  synonymous  with  inductive  problems  of 
‘noise’;  solutions  to  the  problems  of  the  lat¬ 
ter  type  can  be  usefully  adapted  to  the  for¬ 
mer. 

1  Introduction 

An  important  objective  of  machine  learning  research 
is  to  improve  the  efficiency  of  search.  This  includes 
the  compilation  of  operator  sequences  into  macro¬ 
operators  and  the  adaptation  of  object  concept  learn¬ 
ing  methods  to  guide  operator  application  (Mitchell, 
Utgoff,  &  Banerji,  1983;  Langley,  1985).  However, 
recent  research  has  qualified  the  naive  application  of 
these  techniques:  learned  search  control  knowledge 
varies  in  its  uiility  (Minton,  1988).  In  the  worst 
case,  learned  knowledge  can  have  a  detrimental  effect 
on  search  since  the  search  to  find  applicable  learned 
knowledge  can  be  more  costly  than  the  search  that  an 
uninformed  system  would  require. 

This  paper  illustrates  that  incremental  conceptual 
clustering  or  concept  formation  can  organize  search 
control  knowledge  for  efficient  reuse.  In  particu¬ 
lar,  operator  choices  made  during  successful  searches 
are  clustered  into  ‘similarity’  classes  that  capture  the 

*This  research  was  supported  by  NASA  Ames  grant 
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shared  context  in  which  operators  were  applicable. 
Operator  selection  during  later  search  is  guided  by 
classification  of  contextual  information.  However,  re¬ 
liance  on  classification  is  qualified  by  ‘noise  toler¬ 
ant’  strategies  that  demonstrably  mitigate  the  ‘utility’ 
problem.  In  fact,  a  general  observation  of  our  work  is 
that  problems  of  ‘noise’  in  inductive  concept  learning 
and  problems  of  ‘utility’  in  search  control  learning  are 
closely  linked  in  form  and  in  solution. 

2  Search  and  Concept  Induction 

There  are  two  facets  to  the  problem  of  learning  search 
control  knowledge;  generating  plausible  abstractions 
of  when  operators  should  be  applied  and  filtering  these 
abstractions  for  their  utility  (Etzioni,  1988).  These 
facets  correspond  to  similar  aspects  of  concept  learn¬ 
ing.  This  connection  has  been  traditionally  recognized 
with  respect  to  the  generation  of  plausible  abstrac¬ 
tions.  Early  work  on  systems  such  as  Sage  (Langley, 
1985)  and  Lex  (Mitchell,  Utgoff,  &  Banerji,  1983) 
applied  empirical  concept  learning  techniques  to  com¬ 
plete  solution  traces,  thus  inducing  conditions  under 
which  an  operator’s  application  previously  led  to  a 
goal  state.  Using  information-theoretic  methods,  Ren- 
dell,  Seshu,  and  Tcheng  (1987)  used  PlsI,  a  ‘util¬ 
ity’  clustering  system,  to  group  problems  with  simi¬ 
lar  solution  strategies.  Purely  analytic  concept  learn¬ 
ing  approaches  to  uncovering  search  control  knowl¬ 
edge  have  also  been  investigated  under  the  rubric  of 
explanation-based  learning  (DeJong  &  Mooney,  1986; 
Mitchell,  Keller,  fc  Kedar-Cabelli,  1986).  These  tech¬ 
niques  use  analytic,  typically  deductive  strategies  to 
find  conditions  under  which  ‘operators’  should  apply 
from  a  small  number  of  ‘solution’  traces. 

VAAVA4  V/IA 

focused  almost  exclusively  on  the  generation  of  plau¬ 
sible  abstractions  However,  recently  there  has  been 
the  observation  that  rule  application  varies  in  util¬ 
ity:  the  degree  that  rule  application  alters  the  number 
of  steps  (i.e.,  subproblems,  states)  encountered  dur- 
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ing  search.*  Rule  application  in  certain  contexts  may 
actually  detract  from  search  efficiency.  Several  ap¬ 
proaches  to  eliminating  harmful  rule  application  have 
been  examined.  For  example,  related  techniques  by 
Hansson  and  Mayer  (in  press)  and  Wefald  and  Rus¬ 
sell  (1989)  assess  whether  significant  information  is 
gleaned  about  eventual  goal  satisfaction  from  further 
state  expansion.  Probability  estimates  used  in  this 
assessment  are  learned  by  analyzing  the  state  space 
expanded  during  search.  With  sufficient  experience 
probability  estimates  terminate  expansion  at  states 
from  which  it  is  unlikely  that  useful  information  can 
be  found  by  further  search  (e.g.,  the  system  may  be 
very  certain  about  eventual  goal  achievement  from  the 
current  state,  thus  diminishing  the  need  for  further 
search  since  it  is  unlikely  to  yield  significant  new  in¬ 
sights  about  eventual  outcomes). 

Recently,  research  in  explanation-based  learning 
has  employed  similarly-intended,  though  differently- 
implemented  methods  for  controlling  search  (Minton, 
1988;  Mooney,  1989).  Research  in  this  area  has  fo¬ 
cused  on  the  efficacy  of  exploiting  learned  rules  versus 
simply  using  the  primitive  operators.  For  example, 
Markovitch  &  Scott  (1989)  learn  probability  estimates 
that  subgoals  can  be  satisfied.  Learned  rules  are  not 
used  in  proof  attempts  of  subgoals  that  are  not  likely 
to  be  successful,  only  primitive  rules  are  used  in  these 
cases,  thus  avoiding  redundant  search. 

As  we  have  noted,  work  with  Sage  and  Lex  recog¬ 
nized  the  applicability  of  concept  induction  methods 
to  generate  plausible  search  control  rules.  Similarly, 
we  believe  that  utility  can  be  tested  by  noise-tolerant 
strategies  of  concept  learning.  For  example,  IDS 
(Quinlan,  1986)  generates  decision  trees  from  train¬ 
ing  data  using  a  measure  of  the  information  transmit¬ 
ted  about  class  membership  (e.g.,  disease)  by  each  at¬ 
tribute  used  to  describe  objects  (e.g.,  patient  case  his¬ 
tories).  The  values  of  the  most  informative  attribute 
label  arcs  of  the  tree  and  are  used  to  divide  the  train¬ 
ing  set;  the  information-theoretic  measure  is  used  to 
recursively  divide  each  training  subset,  thus  forming  a 
decision  tree.  During  tree  construction,  an  estimate  of 
whether  ‘significant’  information  is  gained  about  class 
membership  is  made  by  a  chi-square  heuristic.  Similar 
to  Wefald  and  Russell  (1989)  and  Hansson  and  Mayer 
(in  press),  tree  expansion  is  terminated  at  nodes  where 
the  divisive  attribute  does  not  transmit  significant  in¬ 
formation  about  class  membership;  at  this  point,  the 
most  common  class  among  the  training  subset  is  used 
to  label  the  appropriate  leaf  of  the  decision  tree.  Sub¬ 
sequent  data  is  cla-ssified  by  traversing  appropriate 
paths  of  the  tree  to  a  leaf,  where  an  appropriate  class 
designation  resides.  Quinlan  and  others  (Michalski, 

*  Recent  work  also  points  out  that  search  control  rules 
vary  in  their  match  cost  -  a  test  of  a  rule’s  applicability 
(Minton,  1988;  Tambe  &  Rosenbloom,  1989).  Our  work 
thus  far  concentrates  on  reducing  the  states  examined  dur¬ 
ing  search,  although  we  will  return  to  issues  of  match  cost 
in  Section  5. 


1987)  have  shown  that  this  general  strategy  eliminates 
the  use  of  ‘low  utility’  concepts  (or  portions  thereof), 
classification  accuracy  is  not  adversely  affected  or  is 
actually  improved  by  the  process. 

In  the  following  two  sections  we  describe  the  appli¬ 
cation  of  empirical  concept  learning  to  the  induction, 
organization,  and  exploitation  of  search  control  rules. 
Like  earlier  research,  we  are  concerned  with  the  gen¬ 
eration  of  plausible  concepts  and  control  rules.  Our 
work  in  this  area  is  distinguished  by  the  use  of  prob¬ 
abilistic  concepts  (i.e.,  control  rules),  a  representation 
scheme  that  allows  partial  matching.  Moreover,  our 
work  is  further  distinguished  by  its  attention  to  the 
connection  between  post-generation  tests  of  utility  in 
traditional  concept  learning  systems  and  search  con¬ 
trol  learning. 

3  The  COBWEB  Concept  Formation 
System 

Empirical  concept  learning  is  typically  concerned  with 
improving  prediction  accuracy.  In  search  control 
learning  this  translates  into  a  concern  for  accurately 
predicting  the  operator  that  should  be  applied  under 
current  conditions;  more  accurate  predictions  result 
in  a  more  directed,  efficient  search.  For  example,  the 
search-intensive  task  of  language  recognition  is  highly 
anticipatory;  a  parser  expects  (i.e.,  predicts)  a  par¬ 
ticular  symbol  next  on  the  input  stream.  If  the  next 
symbol  is  not  as  expected  then  the  parser  has  made 
an  incorrect  prediction;  this  may  actually  reflect  on 
an  incorrect  prediction  that  was  made  several  symbols 
earlier  but  has  only  caused  a  contradiction  after  sub¬ 
sequent  processing.  In  the  case  of  the  phrase  big  blue 
bugle  boy,  the  subphrase  btg  blue  might  be  a  nickname, 
big  blue  bugle  might  refer  to  a  large  blue  brass  instru¬ 
ment,  but  the  intent  of  the  phrase  is  that  big,  blue,  and 
bugle  all  modify  the  noun  boy.  Of  the  relevant  pars¬ 
ing  operators,  (PARSE  adjective)  and  (PARSE  noun), 
neither  can  be  predicted  with  certainty  at  intermedi¬ 
ate  points  in  the  phrase.  As  with  any  search-intensive 
system,  the  parser  must  backtrack  to  an  indetermi¬ 
nate  depth  and  try  alternatives  until  contradictions 
are  eliminated. 

3.1  Hierarchy  Generation 

To  reduce  backtracking  we  wish  to  better  predict  the 
likelihood  that  an  operator  applies  under  current  con¬ 
ditions.  Our  particular  approach  to  improving  search 
efficiency  is  through  a  conceptual  clustering  system, 

1 0Q7'  'TV*»o  »>y»  ifrtrcd 

system  that  incrementally  builds  classification  trees 
from  objects  that  are  described  by  nominal  attribute- 
value  pairs.  Stored  at  each  node  of  the  tree  are  the 
value  distributions  of  each  attribute  over  the  objects 
classified  under  the  node.  For  example,  if  90%  of  the 
objects  stored  under  a  node,  n,  are  blue,  then  the 
blue  feature  would  be  weighted  accordingly.  /-’(Color 
=  blue|n)  =  0.9.  Each  node  is  a  probabilistic  concept 
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(Smith  Si  Medin,  1981);  the  classification  tree  is  a 
probabilisiic  concept  tree. 

Each  tree  level  contains  sibling  classes  that  collec¬ 
tively  partition  the  observed  objects.  COBWEB  can 
incrementally  incorporate  a  new  object  into  the  class 
that  best  matches  the  object  according  to  category 
utility  (Gluck  Sc  Corter,  1985),  a  measure  that  rewards 
the  formation  of  object  classes  that  improve  ‘predic¬ 
tion  ability*.  More  formally,  the  category  utility  of  a 
class,  Ck,  is  a  function  of  the  expected  number  o^  at¬ 
tribute  VcJues  that  can  be  correctly  predicted  about 
members  of  the  class,  jE(#Correc<Predtc<jons|Cj;). 
This  expectation  can  be  further  formalized  in  terms  of 
conditional  probabilities  that  are  stored  at  tree  nodes: 
E{ii:CorrectPredictions\Ck)  =  52^  Yl]  — 

Vij\Ck)^.  Category  utility  has  some  additional  com¬ 
plexities  (Gluck  &  Corter,  1985;  Fisher,  1987),  but 
for  our  purposes  it  is  sufficient  to  note  that  category 
utility  is  a  function  of: 

«  3 

where  P{Ck)  is  the  proportion  of  the  observations  to 
which  the  expectation  applies;  P{Ck)  is  the  probabil¬ 
ity  that  the  benefits  (i.e.,  expectations)  of  a  class  will 
be  realized. 

COBWEB  incrementally  filters  objects  into  appro¬ 
priate  classes  based  on  category  utility.  A  new  ob¬ 
ject  is  evaluated  with  respect  to  a  class  by  tentatively 
placing  the  object  in  the  class;  each  class’s  attribute- 
value  distributions  are  tentatively  updated  to  reflect 
the  values  of  the  new  object.  Probabilities  are  com¬ 
puted  from  the  tentatively  updated  distributions  and 
the  category  utility  of  the  class  is  computed.  The 
class  that  meiximizes  category  utility  after  adding  the 
new  object  is  chosen  to  classify  the  object  and  appro¬ 
priate  distributions  are  updated  permanently.  This 
process  is  recursively  applied  to  the  subtrees  rooted 
at  the  selected  child  until  a  leaf  is  reached.  A  leaf 
is  a  singleton  class  that  represents  a  previously  ob¬ 
served  object.  While  objects  are  predominantly  in¬ 
corporated  with  respect  to  existing  classes,  operators 
also  exist  for  new  node  (class)  creation,  node  com¬ 
bination  (merging),  and  node  division  (splitting).  A 
more  complete  description  of  COBWEB  can  be  found 
in  Fisher  (1987). 

Object  incorporation  is  easily  adapted  to  allow 
object  classification  and  prediction:  category  utility 
guides  an  object  along  a  path  of  nodes  to  a  ‘best’ 
matching  leaf.  If  any  value(s)  are  missing  from  the 
new  observation,  they  may  be  predicted  from  the 
known  values  of  the  leaf.  While  COBWEB  trees  are 
reminiscent  of  decision  trees,  probabilistic  concepts 
are  polythetic  in  that  multiple  attributes  guide  clas¬ 
sification.  If  an  object  has  missing  attribute  values 
then  category  utility  acts  as  a  partial-matching  func¬ 
tion  with  summation  limited  to  probabilities  of  known 
attributes. 


3.2  Hierarchy  Evaluation:  Pruning  and 
Utility 

Cobweb’s  strategy  of  prediction  at  best  matching 
leaves  demonstrably  yields  good  results  in  many  do¬ 
mains,  but  like  early  versions  of  ID3  this  strategy  can 
often  diminish  prediction  accuracy  (cf..  Section  2). 
More  generally,  all  inductive  learning  systems  require 
that  a  domain  exhibit  regularities  (i.e.,  dependencies 
between  attributes)  if  learning  is  to  be  beneficial.  If 
the  learning  system  is  too  persistent  in  trying  to  un¬ 
cover  regularities  where  no  significant  ones  exist,  then 
this  can  result  in  ‘overfitting’.  This  is  also  the  case 
in  learning  to  search;  if  no  or  little  correlation  exists 
between  the  conditions  of  a  state  and  the  eventual 
success  of  an  operator  application,  then  persistence 
in  trying  to  discover  such  a  connection  will  result  in 
overfitting.  In  search  control  tasks,  overfitting  reduces 
the  accuracy  with  which  operator  applications  are  pre¬ 
dicted,  thus  causing  greater  backtracking.  In  both 
concept  learning  and  search  control  learning,  the  util¬ 
ity  of  certain  ‘rules’  is  negligible  or  detrimental. 

A  recent  version  of  Cobweb  (Fisher,  1989)  employs 
a  past-performance  method  for  disposing  of  low  utility 
rules.  This  method  was  inspired  by  Quinlan  o  (1987) 
reduced  error  pruning,  but  the  general  approach  is  also 
related  to  Hansson  and  Mayer’s  and  Wefald  and  Rus¬ 
sell’s  strategies  for  terminating  search.*  In  particu¬ 
lar,  as  Cobweb  classifies  an  object  it  determines  at 
each  node  whether  an  attribute  would  be  correctly 
predicted  at  the  node.  To  do  this,  it  compares  the  ob¬ 
ject’s  actual  value  along  this  attribute  with  the  most 
common  (i.e.,  most  probable)  value  of  the  attribute  at 
the  node.  If  the  two  values  are  equal,  then  Cobweb 
would  have  correctly  predicted  the  attribute’s  value  if 
it  had  been  required  to;  in  this  case,  a  counter  is  in¬ 
cremented  indicating  that  a  correct  prediction  would 
have  been  made  at  this  node.  In  addition.  Cobweb 
also  records  whether  the  attribute  would  have  been 
correctly  predicted  at  a  descendant  of  the  node.  Thus, 
each  node  holds  two  counts  for  an  attribute;  one  of 
how  often  a  correct  prediction  would  have  been  made 
at  the  node,  and  one  of  how  often  a  correct  predic¬ 
tion  would  have  been  made  at  a  descendent  of  the 
node.  When  a  prediction  of  an  attribute  is  actually 
necessary  (i.e.,  the  attribute’s  value  is  unknown),  then 
classification  descends  to  a  node  at  which  the  unknown 
attribute  is  more  accurately  predicted  relative  to  its 
descendants,  the  most  common  value  of  the  attribute 
is  used  as  Cobweb’s  prediction. 

Our  summary  of  this  past-performance  ‘pruning’ 

^Fisher  (.1989)  also  experimented  with  a  chi-square 
method  of  terminating  classification  that  directly  assesses 
the  significance  of  the  information  gained  by  deeper  classi¬ 
fication.  Gennari,  Langley,  and  Fisher  (1989)  used  a  cut¬ 
off  parameter  to  prune  a  classification  subtree  based  on 
a  user-supplied  threshold  of  required  ‘information  gain’. 
This  method  was  not  sensitive  to  differences  between  at¬ 
tributes  and  relied  entirely  on  the  user  to  specify  ‘signifi¬ 
cant’  gain  thresholds. 
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Figure  1;  A  NPDA  transition  diagram  for  simplified- 
English  parsing. 

strategy  has  been  brief  of  necessity,  but  it  does  not 
add  to  the  asymptotic  cost  of  learning  or  classification 
and  it  considerably  improves  prediction  accuracy  over 
the  initial  strategy  of  always  classifying  an  object  to 
a  leaf.  With  this  in  mind,  we  turn  to  the  application 
of  these  concept  formation  strategies  to  our  primary 
goal;  the  effective  management  of  search  control. 

4  An  Example:  Search  Control  in 
Parsing 

Our  objective  is  to  demonstrate  that  Cobweb  (and 
by  implication  other  conceptual  clustering  systems  as 
well)  can  effectively  direct  search  by  predicting  op¬ 
erators  that  are  best  applied  under  ‘current  circum¬ 
stances’.  Consider  a  detailed  parsing  example  sim¬ 
ilar  to  the  one  given  at  the  beginning  of  Section  3. 
Figure  1  shows  a  nondeterministic  (i.e.,  backtracking) 
push-down  automata  (NPDA)  for  recognizing  an  ar¬ 
tificial,  but  nontrivial  language.  Arcs  have  two  types 
of  labels:  those  preceded  by  an  up/down  arrow,  and 
these  that  are  a  lower  case  letter.  The  lower  case  let¬ 
ters  denote  input  symbols  that  are  to  be  parsed.  A 
down  arrow,  (|),  followed  by  a  letter  denotes  a  sym¬ 
bol  to  be  pushed  on  a  stack  when  the  arc  is  crossed  in 
parsing.  An  arc  labeled  by  an  up  arrow,  (|),  denotes 
that  the  symbol  must  reside  at  the  ton  of  the  stack, 
and  is  popped  from  the  stack.  It  is  assumed  that  the 
stack  contains  only  a  S  when  parsing  begins.  A  sen¬ 
tence  is  accepted  (i.e.,  successfully  parsed)  if  there  is 
at  least  one  way  to  enter  the  final  state  of  the  machine 
(i.e.,  state  17)  with  an  empty  stack  and  an  empty  in¬ 
put  stream. 

Consider  “abdabbdbabdeeecde”,  which 
is  a  string  accepted  by  the  NPDA.  A  successful  parse 


begins  by  pushing  S  on  the  stack  in  crossing  from  slate 
0  to  state  1,  parsing  “a  b  d  a  b”  leaving  the  system  in 
state  7.  At  state  7,  a  T  is  pushed  on  the  stack,  then 
an  S  leaving  the  system  in  state  1.  Next  the  symbols, 
“b  d  b  a  b  d  e  e  e  e”  are  parsed,  and  then  the  S  and  T 
are  popped  off  the  stack.  Next  “d  e”  are  parsed  and 
the  final  S  and  S  are  popped  off  the  stack,  leaving  the 
system  in  state  17,  the  final  state. 

Our  example  illustrates  the  moves  necessary  for  a 
successful  parse,  but  the  NPDA  contains  4  points  of 
nondeterminism:  state  4  given  an  ‘a’,  slate  7  given  a 
‘b’,  state  8  given  a  ‘d’,  and  state  9  given  an  ‘e’.  The 
arcs  out  of  state  9  model  precisely  the  dilemma  of 
the  noun/adjective  example  at  the  beginning  of  Sec¬ 
tion  3.  This  nondeterminism  is  the  cause  of  search: 
each  point  of  nondeterminism  must  be  tried  and  may 
result  in  backtracking  if  the  wrong  choice  is  made. 
The  NPDA  is  constructed  so  that  certain  incorrect 
gue.sses  are  discovered  rather  quickly,  while  others  are 
not  uncovered  for  several  moves. 

To  reduce  backtracking  we  present  information 
about  successful  parses  to  Cobweb,  in  the  liope  that 
the  system  can  use  this  information  to  guide  future 
parsing.  In  particular,  after  a  sentence  is  successfully 
parsed,  a  complete  trace  of  the  choices  required  from 
the  start  state  through  the  final  state  is  returned,  this 
does  not  include  the  choices  that  were  retracted  via 
backtracking.  Each  choice  is  regarded  as  an  operator 
to  be  predicted  from  decisions  that  were  made  previ¬ 
ously.  Thus,  the  complete  trace  is  segmented  into  ‘ob¬ 
jects’  (i.e.,  windows)  of  .seven  attributes,  one  object 
for  each  choice  made  in  the  trace.  The  values  of  four  of 
these  attributes  are  the  four  choices  (operators)  that 
were  made  just  prior  to  the  current  choice;  two  of  these 
attributes  are  the  top  two  stack  symbols  at  the  time  of 
the  current  choice;  finally,  the  seventh  attribute  is  the 
current  choice.  Thus,  a  single  parse  of  ten  choices  will 
be  segmented  into  ten  distinct  objects.®  After  sen¬ 
tence  recognition,  the  successful  parse  is  segmented 
and  the  constituent  objects  are  incorporated  into  a 
Cobweb  hierarchy.  During  subsequent  parsing  Cob¬ 
web  is  used  as  an  oracle  for  predicting  the  current 
choice,  given  a  window  of  four  previous  choices  and 
the  top-most  stack  symbols. 

To  test  the  savings  provided  by  a  COBWEB  oracle, 
a  CoBWEB-enhanced  parser  was  trained  on  success¬ 
ful  parses.  Training  sentences  were  generated  so  as  to 
usually  adhere  to  two  rules,  although  neither  was  fol¬ 
lowed  100%  of  the  time,  thus  introducing  some  ‘noise’. 
The  first  rule  was  that  pushing  a  T  or  a  U  (lead¬ 
ing  into  state  15)  was  dependent  upon  liie  path  from 
state  1  to  state  4.  The  second  regularity  was  that 
roughly  4  consecutive  e’s  should  occur  when  in  state 
9.  The  CoBWEB-enhanced  parser  was  trained  on  five 
randomly  generated  sentences  (under  the  above  nien- 


®Note  that  the  initial  choices  -  tliose  without  four  pre¬ 
decessors  -  are  still  represented  but  without  sonie  initial 
attributes. 


Search  Control,  Utility,  and  Concept  Induction 


89 


Sentence  Number 

Figure  2:  Backtracks  required  by  CoBWEB-guided 
parser  and  best  of  16  alternative  parsers. 


tioned  regularity  constraints),  which  after  windowing 
constituted  a  total  of  one  to  two  hundred  training  ob¬ 
jects.  Four  more  sentences  were  randomly  generated 
to  use  as  additional  test  sentences."*  For  compara¬ 
tive  purposes,  the  number  of  backtracks  required  by 
a  CoBWEB-enhanced  parser  was  compared  with  six¬ 
teen  hard-coded  parsers;  recall  that  the  NPDA  of  Fig¬ 
ure  1  has  four  points  of  nondeterminism  -  two  choices 
per  point.  The  sixteen  hard-coded  parsers  correspond 
to  the  (2^  =)  16  possible  orderings  on  these  choices. 
These  orderings  roughly  correspond  to  all  the  possible 
ways  that  an  ‘expert’  might  order  choice  preferences 
with  sufficient  knowledge  of  individual  choice  frequen¬ 
cies.  All  nine  (training  and  test)  sentences  were  run 
against  the  16  hard-coded  machines;  Cobweb  was 
trained  on  the  first  5  of  these  sentences  and  like  the 
hard-coded  machines  was  tested  against  all  nine. 

Figure  2  compares  the  number  of  backtracks  re¬ 
quired  by  the  COBWEB-enhanced  parser  and  the  num¬ 
ber  required  by  the  best  hard-coded  machine  (per  sen¬ 
tence)  for  each  of  the  nine  sentences  (note  the  loga¬ 
rithmic  vertical  scale).  The  CoBWEB-enhanced  parser 
out  performs  the  minimums  of  the  hard-coded  ma¬ 
chines  on  all  sentences,  except  one  (sentence  #  5) 
where  the  number  of  backtracks  was  one  in  each  case. 
A  nonparametric  sign  test  reveals  that  the  Cobweb- 
enhanced  parser  properly  minimized  backtracks  over 
the  collective  minimum  of  the  hard-coded  machines 
in  a  statistically-significant  (or  =  0.01)  number  of 
cases  (i.e.,  8,  with  1  tie);  even  if  we  could  deter¬ 
mine  a  priori  the  best  hardcoded  machine  to  parse 
each  sentence,  the  enhanced  parser  would  still  win  in 
terms  of  required  backtracks.  Overall,  the  Cobweb- 
enhanced  parser  requires  60  backtracks  for  all  nine 


*  Originally,  five  additional  test  sentences  were  gener¬ 
ated,  but  one  was  identical  to  a  training  sentence  and  was 
removed  from  consideration. 
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Figure  3:  Sensitivity  to  input  window  and  stack  sym¬ 
bols. 


sentences,  while  the  sum  of  the  hard-coded  machine 
minimums  comes  to  378  backtracks.  If  we  compare  the 
CoBWEB-enhanced  parser  to  the  individual  machines 
then  the  savings  become  even  more  pronounced:  the 
backtracks  required  by  the  individual  machines  ranged 
from  643  to  2626,  as  compared  to  378  for  their  collec¬ 
tive  minimum  and  60  for  our  Cobweb  parser.® 

We  also  investigated  the  sensitivity  of  the  Cobvveb- 
enhanced  parser  to  window  size  and  number  of  stack 
symbols  that  were  used  during  learning.  In  the  results 
that  we  report  above,  we  assumed  a  input  window 
size  of  5  and  access  to  the  top  2  stack  symbols.®  Fig¬ 
ure  3  compares  the  total  number  of  backtracks  over 
all  sentences  for  various  window/stack  sizes  and  the 
sum  total  of  the  minimum  parse  of  each  sentence  for 
any  of  the  static  machines  (Min  of  16).  For  each  of 

®There  are  several  interesting  issues  that  arise  when 
we  consider  using  concept  formation  techniques  to  guide 
language  parsing.  In  particular,  the  relative  performance 
merits  of  and  conceptual  links  between  out  heuristically- 
guided  parser  and  efficient  parsers  for  such  things  as  LR(k) 
grammars  are  of  some  interest,  but  are  tangential  to  the 
objectives  of  this  paper.  Notably,  out  heuristic  parser  as¬ 
sumes  that  statistical  correlations  exist  between  parsing 
transitions;  if  such  correlations  exist  then  concept  induc¬ 
tion  techniques  can  hope  to  provide  some  speedup  over 
standard  parsing  methods,  regardless  of  ihc  language  class 
If  no  such  correlations  exist,  then  inductive  techniques  will 
provide  no  speedups  over  the  standard  parsing  methods  for 
a  language.  More  generally,  the  objectives  of  this  paper 
are  to  illustrate  links  between  inductive  concept  formation 
and  search  control  management,  and  thus  we  will  defer  dis¬ 
cussion  of  parsing-specific  issues. 

®As  we  stated  earlier,  during  testingj4  of  the  5  window 
symbols  will  correspond  to  the  4  previous  choices,  while 
the  5th  will  correspond  to  the  choice  to  be  predicted. 
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Figure  4:  Backtracks  required  with  and  without  noise- 
tolerant  classification  strategy. 


the  window/stack  combinations  that  we  investigated, 
the  number  of  backtracks  required  by  the  Cobweb 
parser  is  considerably  less  than  the  hardcoded  mini- 
mums.  A  second  observation  is  that  the  performance 
of  our  parser  appears  to  improve  as  the  amount  of  in¬ 
formation  (input  and  stack  symbols)  available  to  guide 
classification  increases.  The  exception  to  this  trend 
occurs  between  (0,3)  and  (0,5),  but  we  believe  this  to 
be  an  aberration  caused  by  an  exceptional  sentence. 
In  fact,  the  (0,5)  properly  minimized  the  backtracks 
relative  to  the  (0,3)  condition  for  6  of  the  9  sentences, 
with  1  tie  and  2  cases  in  which  (0,3)  minimized  back¬ 
tracks.  Though  our  experiments  are  not  sufficient  to 
make  statistically-justified  claims  for  the  relative  ad¬ 
vantage  of  differing  (stack,  window)  conditions  at  the 
a  —  0.05  level,  we  nonetheless  believe  that  more  ex¬ 
tensive  experiments  will  confirm  the  trend  within  cer¬ 
tain  limits. 

Finally,  the  Cobweb  results  reported  above  assume 
the  classification  strategy  designed  to  avoid  overfitting 
that  was  described  Section  3.  To  test  our  contention 
that  this  strategy  avoids  overfitting  and  the  detrimen¬ 
tal  use  of  low-utility  control  knowledge,  we  compared 
these  results  to  a  parser  that  made  predictions  sug¬ 
gested  by  best-matching  leaves  of  the  learned  concept 
tree.  Figure  4  compares  these  alternative  versions; 
classification  to  a  leaf  yields  less  reliable  results  and  is 
often  much  more  costly  in  terms  required  backtrack¬ 
ing.  The  past-performance  strategy  outperforms  clas¬ 
sification  to  a  leaf  in  7  of  the  9  cases,  with  2  ties. 
This  is  significant  using  the  nonparametric  sign  test 
at  O'  =  0.05.  This  supports  our  specific  claim  that 
our  past-performance  strategy  gives  a  good  measure 
of  rule  utility.  More  generally,  a  promising  conjecture 
is  that  strategies  designed  to  enhance  noise  tolerance 
in  concept  learning  may  be  useful  in  mitigating  utility 
problems  in  search  control. 


5  Concluding  Remarks 

We  have  demonstrated  the  efficacy  of  concept  forma¬ 
tion  in  the  management  of  search  control  knowledge. 
Cobweb’s  probabilistic  representation  of  search  con¬ 
trol  facilitates  greater  flexibility  in  guiding  search:  in 
addition  to  hard-and-fast  rules,  tendencies  (‘hunches’) 
also  guide  search.  This  characteristic  is  particularly 
important  in  tasks  like  parsing  since  it  may  be  impos¬ 
sible  to  tell  with  certainty  whether  a  particular  oper¬ 
ator  is  applicable  prior  to  its  invocation. 

Although  our  approach  was  tested  on  a  parsing 
task  we  believe  that  it  can  be  adapted  to  other 
search-intensive  tasks,  though  this  may  require  that 
we  overcome  representational  limitations.  In  partic¬ 
ular,  Cobweb  requires  nominal  attribute-value  rep¬ 
resentations.  This  representation  is  sufficient  to  deal 
with  certain  other  domains  that  have  been  examined 
in  search  control  research  such  as  the  8-tile  puzzle 
and  other  games  (Wefald  &  Russell,  1989),  but  more 
complicated  search  tasks  such  as  planning  and  those 
found  in  EBL  research  will  require  relational  repre¬ 
sentations.  In  fact,  extensions  of  CoBWEB-like  strate¬ 
gies  that  deal  with  relational  representations  are  be¬ 
ing  investigated  (Yoo  &  Fisher,  1990;  Yang,  Fisher,  & 
Franke,  in  press). 

A  second,  but  more  subtle,  representation  issue 
arises  when  we  examine  more  deeply  the  meaning 
of  rule  ‘utility’.  In  particular,  we  have  ignored  the 
‘match-cost’  aspect  of  utility  (Tambe  &  Rosenbloom, 
1989);  the  COBWEB-enhanced  parser  uniformly  re¬ 
quires  greater  execution  time  because  of  increased 
match  cost.  In  part,  this  is  due  to  ‘uninteresting’ 
factors  such  as  (1)  inefficiencies  of  the  Cobweb  im¬ 
plementation,  which  we  have  not  tailored  to  this  do¬ 
main,  and  (2)  the  exceedingly  low  cost  of  backtrack¬ 
ing  with  an  NPDA.  More  fundamentally  though,  it 
appears  that  match  costs  are  magnified  by  probabilis¬ 
tic  concepts,  which  require  that  we  match  many  at¬ 
tributes  of  a  concept  -  even  those  that  may  be  sta¬ 
tistically  irrelevant  to  class  membership.  Fortunately, 
the  same  noise-tolerance  strategies  that  identify  when 
an  attribute  is  best  predicted  are  also  useful  in  deter¬ 
mining  when  attributes  are  irrelevant  for  classification 
purposes.  Thus,  we  believe  that  these  strategies  sug¬ 
gest  a  promising  path  for  mitigating  the  match-cost 
factor  of  probabilistic-rule  utility  (Gennari,  1989). 

Finally,  we  believe  that  a  notable  contribution  of 
this  work  is  that  it  illustrates  a  link  between  concept 
induction  strategies  for  noise  tolerance  and  search  con¬ 
trol  issues  of  utility;  classification/search  should  ter¬ 
minate  at  points  that  do  not  helpfully  inform  pre¬ 
diction.  In  Section  2  we  alluded  to  a  point  that 
we  more  forcefully  argue  elsewhere  (Fisher  &  Chan, 
in  press;  Yoo  &  Fisher,  1990)  -  that  overfitting  is 
also  at  the  root  of  the  utility  problem  as  it  per¬ 
tains  to  EBL  and  domain  theory  search.  Learned 
rules  may  represent  logically  possible,  though  statis¬ 
tically  idiosyncratic  connections  between  patterns  of 
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operational  predicates  eind  target  consequences.  The 
application/consideration  of  such  rules  may  actually 
have  a  detrimental  effect  on  subsequent  problem  solv¬ 
ing  (Mooney,  1989;  Markovitch  &  Scott,  1989).  Our 
current  work  seeks  to  unify  inductive  issues  of  noise 
and  explanation-based  issues  of  utility  in  the  context 
of  concept  formation  over  problem-solving  experience 
(Yoo  k  Fisher,  1990)  and  plans  (Yang,  Fisher,  & 
IVanke,  in  press).^  Our  work  tentatively  suggests  that 
EBL  mechanisms  of  selective  retention  (Markovitch 
&  Scott,  1989;  Minton,  1988)  and  selective  utiliza¬ 
tion  (Markovitch  &  Scott,  1989;  Mooney,  1989)®  are 
synonymous  in  spirit  to,  and  better  implemented  by, 
much  studied  inductive  methods  of  pruning  (Quinlan, 
1986;  Michalski,  1987)  and  preference  identification 
(Fisher,  1989),  respectively. 
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Abstract 

This  paper  uses  a  minimal  representation  cri¬ 
terion  to  formally  define  tasks  of  matching, 
classification,  and  interpretation  .f  objects 
represented  as  graphs,  as  well  as  conceptual 
clustering  of  graphs,  and  supervised  learning 
of  structured  concepts  represented  as  prob¬ 
abilistic  graphs.  Tbgse  definitions  do  not 
rely  on  acceptance  thresholds,  or  other  user 
selectable  parameters.  The  resulting  prob¬ 
lems  of  combinatorial  optimization  are  ap¬ 
proximately  solved  by  a  fast  graph  matching 
heuristic,  which  is  a  key  element  of  the  de¬ 
scribed  learning  methods,  that  include  forced 
learning  of  graph  models,  and  two  graph  clus¬ 
tering  methods;  incremental,  and  agglomer- 
ative.  These  methods  apply  to  usual  directed 
graphs,  and  to  recursively  defined  layered 
graphs.  The  presented  methodology  has  been 
applied  to  real  problems  of  concept  learning, 
classification  and  interpretation  of  nonrigid 
shapes. 

1  Introduction 

Structural  descriptions  composed  from  parts  and  re¬ 
lations  are  often  used  to  describe  complex  entities. 
They  have  been  particularly  useful  in  computer  vision, 
as  representations  of  shape  of  objects  (Barrow  et  al., 
1972;  Connel  &  Brady,  1985;  Shapiro,  1980).  A  conve¬ 
nient  form  of  a  structural  description  is  an  attributed 
graph,  i.e.  a  graph  with  symbolic  attributes,  or  labels, 
attached  to  its  vertices  and  edges. 

The  use  of  a  graph  as  representation  for  data  and 
concepts,  or  models  demands  formulation  of  meth¬ 
ods  for  matching  data  to  concepts,  object  classifica¬ 
tion,  and  concept  learning.  Variety  <./f  such  methods 
have  been  proposed,  some  based  on  exact  matching, 
e.g.  Winston,  (1975),  other  allowing  inexact  match  by 
defining  a  graph  distance,  or  probability,  as  Wong  & 
You  (1985).  However,  a  practical  use  of  many  of  the 


graph  based  methods  seems  to  be  limited  by  compu¬ 
tational  cost  of  graph  matching,  and  lack  of  robust¬ 
ness.  The  latter  is  most  often  caused  by  use  of  con¬ 
trol  parameters,  such  as  acceptance  level,  threshold,  or 
weights,  without  automatic  methods  of  their  selection, 
so  they  may  have  to  be  set  by  the  user  differently  for 
each  application. 

Recently,  we  introduced  a  method  for  supervised 
learning  of  graph  models  (Segen,  1988a),  which  uses  a 
minimal  representation  criterion  (Segen  &  Sanderson, 
1979;  Segen,  1980)  to  define  graph  matching.  This 
method  is  very  practical,  since  it  is  free  of  user  set- 
table  parameters,  and  fast,  due  to  an  efficient  graph 
matching  technique.  It  has  been  applied  to  recognition 
and  supervised  learning  of  nonrigid  shapes. 

In  this  paper  we  describe  methods  for  graph  cluster¬ 
ing,  based  on  the  same  approach,  and  present  general 
methodology  for  supervised  and  unsupervised  learn¬ 
ing,  classification  and  interpretation,  using  layered 
graph  representation.  The  minimal  representation  cri¬ 
terion  is  used  to  define  these  tasks  formally,  and  to 
guide  their  solutions,  judging  each  step  by  its  data 
compression  ability.  The  presented  methods  have  been 
implemented,  and  applied  to  shape  data  froni  real 
noisy  images.  They  appear  quite  fast  and  robust.  At 
the  end  of  the  paper  we  show  some  examples  from  this 
application. 

2  Layered  graph 

A  directed  graph  is  a  set  of  vertices  V  and  a  set  of 
directed  edges  E.  Beginning  with  a  directed  graph  we 
recursively  define  a  special  case  of  a  hypergraph  called 
a  layered  graph.  A  vertex  v  G  V  will  be  called  a  level  0 
vertex,  or  a  leaf.  Two  level  0  vertices  connected  by  an 
edge  will  be  called  a  level  1  vertex.  If  we  assume  a  set 
of  directed  edges  over  the  set  of  level  1  vertices,  we  can 
similarly  define  a  level  2  vertex,  and  generally  define  a 
level  n+1  vertex  as  an  ordered  pair  (ui,i'2)(  where  vi 
and  V2  are  level  n  vertices.  Their  order  corresponds  to 
the  direction  of  the  edge  between  Vi  and  v^.  We  will 
call  such  a  structure  a  layered  graph.  A  layered  graph 


94  Segen 


Figure  1:  Layered  graph.  Leaves  are  at  the  bottom 

can  be  represented  by  a  directed  acyclic  graph,  whose 
vertices  correspond  to  the  vertices  of  the  layered  graph, 
and  edges  show  the  hierarchical  dependency  among 
vertices.  Figure  1  shows  an  example  of  a  layered  graph 
used  as  a  shape  description.  Referring  to  vertices  of  a 
layered  graph  we  will  use  terms  parent,  child,  ancestor, 
descendant  and  leaf,  in  the  same  sense  as  for  a  tree, 
except  that  here  a  vertex  can  have  multiple  parents. 

A.  layered  graph  has  the  following  properties: 

1.  Each  non-leaf  vertex  v  has  exactly  two  ordered 
children,  distinguished  as  left{v)  and  right{v). 

2.  For  every  vertex  v,  every  path  between  v  and  a 
leaf  vertex  have  the  same  length.  This  length  is 
called  the  level  of  v. 

3.  Two  vertices  can  have  no  more  than  one  common 
parent. 

In  addition,  we  assume  that  each  vertex  v  of  a  lay¬ 
ered  graph  has  a  label  L{v),  which  is  a  symbol  from 
a  finite  alphabet,  and  there  is  a  separate  alphabet  for 
each  level.  Such  a  graph  is  a  special  case  of  an  at¬ 
tributed  hypergraph:  the  leaves  of  the  layered  graph 
are  vertices  of  a  hypergraph,  while  the  higher  level  ver¬ 
tices  are  the  hyperedges.  Further  we  often  refer  to  the 
layered  graph  simply  as  graph. 

3  Probabilistic  graph 

A  group  of  layered  graphs  that  are  not  identical  can 
be  described  using  a  probability  model  whose  outcome 
is  a  layered  graph,  or  a  probabilistic  layered  graph.  vVc 
define  this  model  just  like  a  layered  graph,  except  that 
each  of  its  vertices  is  associated  with  a  probability  dis¬ 
tribution  over  a  set  of  labels,  rather  than  a  single  label. 
A  probability  of  finding  the  label  I  at  a  vertex  v  will 
be  denoted  p(/|u), 

If  T  is  a  mapping  from  vertices  of  a  probabilistic 
graph  M  onto  vertices  of  H,  we  can  assign  to  each 


mapped  vertex  of  M  the  label,  of  the  corresponding 
vertex  of  H.  The  probability  of  H,  given  M  and 
T,  P{H\T,M)  is  the  probability  of  this  set  of  assign¬ 
ments.  Assuming  independently  distributed  vertex  la¬ 
bels,  P{H\T,M)  is  simply  the  product  of  the  proba¬ 
bilities  p(L(ru)|u),  where  u  is  a  vertex  of  M,  and  Tv 
is  the  vertex  of  H  assigned  to  v  by  T. 

4  Minimal  representation  criterion 

The  minimal  representation  criterion  (Segen  & 
Sanderson,  1979;  Segen,  1980)  was  introduced  to  guide 
inference  of  models  in  cases  when  the  maximum  like¬ 
lihood  fails.  Its  formulation  has  been  inspired  by  the 
pioneering  work  of  Solomonoff  (1964).  The  minimal 
representation  criterion  seeks,  in  a  given  set  of  pro¬ 
grams,  a  minimal  length  program  generating  observed 
data  X.  Mapping  a  family  probability  distributions  to 
a  subset  of  programs,  and  seeking  a  distribution  corre¬ 
sponding  to  a  minimal  program,  results  in  an  inference 
rule  that  select  a  distribution  P  that  minimizes 

C{P)-\og^P{X)  (1) 

where  C{P)  is  is  the  number  of  bits  needed  to  specify 
the  probability  model  P,  or  the  cost  of  P.  This  cri¬ 
terion  is  obviously  equivalent  to  the  minimal  descrip¬ 
tion  length  principle  of  Rissanen  (1978, 1987),  and  re¬ 
lated  to  the  information  measure  of  Wallace  &  Boulton 
(1968),  but  it  was  derived  independently. 

5  Matching  graphs 

A  key  mechanism  of  a  learning  system  based  on  a 
graph  representation  is  a  method  for  establishing  a 
mapping  between  elements  of  two  graphs,  or  graph 
matching.  Based  on  the  minimal  representation  cri¬ 
terion,  we  define  the  task  of  matching  a  probabilistic 
graph  M  to  a  graph  H,  as  n  construction  of  a  mapping 
between  vertices  of  M  and  H,  that  maximally  simpli¬ 
fies  description  of  H,  when  H  is  represented  relative 
to  M.  This  definition  provides  a  natural  decision  rule 
for  accepting  a  match,  which  does  not  rely  on  an  ar¬ 
bitrary  threshold.  Intuitively,  it  works  as  follows:  If 
part  of  M  fits  to  a  part  of  H,  then  representing  H  rel¬ 
ative  to  the  matching  part  of  M  should  cost  less  than 
its  default  representation,  independent  of  any  model. 
However,  the  total  cost  of  representing  H  with  the  aid 
of  M  includes  an  overhead  associated  with  specifying 
the  mapping.  If  the  part  of  H  represented  by  M  is  too 
small,  the  saving  may^  not  cover  the  overhead  cost,  and 
it  will  be  cheaper  to  use  the  default  representation,  i.e., 
to  reject  the  match. 

A  default  representation  of  a  graph  is  constructed 
as  follows: 

1.  Specify  N,  the  number  of  leaves  in  H. 

2.  Provide  a  list  of  leaf  labels,  ordered  by  leaf  index. 
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3.  Order  pairs  of  leaves  (e.g.  lexicographically),  and 
specify  the  label  of  the  common  parent  for  each 
pair  (NIL  if  none). 

4.  For  level  1,  order  pairs  of  non-null  vertices.  For 
every  pair,  specify  the  label  of  the  common  parent 
(or  NIL). 

5.  Repeat  the  last  step  for  each  higher  level,  up  to  a 
level  with  all  null  vertices. 

The  cost  of  this  representation  C(H)  is 

C(//)  =  C(Ar)-5];iogp(L(u))  (2) 

where  p{l)  is  a  prior  probability  of  the  label  /.  The  first 
term  is  the  cost  of  representing  N,  the  second  term  is 
the  cost  of  specifying  vertex  labels  (this  value  is  within 
1  bit  from  the  length  of  Shannon  block  encoding). 

A  graph  H  can  be  described  relative  to  a  model 
M,  given  a  one-to-one  mapping  T  from  Vi  C  V{M) 
onto  Vj  C  V{H).  To  represent  H  under  this  mapping 
we  use  the  probability  defined  by  M  for  labels  of  the 
mapped  vertices  in  V2,  instead  of  their  prior  proba¬ 
bility.  The  cost  of  representing  H,  given  M  and  T, 
denoted  C{H\T,  M)  is: 

=  (3) 

where  the  second  term  is  the  sum  of  bits  saved  over  V^. 
The  cost  of  describing  H  relative  to  M,  C{H,T\M) 
includes  the  overhead  for  specifying  T. 

C{H,  TIM)  =  C{H\T,  M)  +  C{T)  (4) 

where  C{T)  is  the  cost  of  T.  To  find  this  cost,  no¬ 
tice  that  mapping  T  is  completely  determined  from  its 
submapping  T'  restricted  to  the  sets  of  leaves  F{M) 
and  F{H).  T'  is  a  mapping  from  Fi  C  F{M)  onto 
F2  Q  F{H),  such  that  T'{v)  =  r(v)  if  u  is  a  leaf. 
This  fact  that  T'  determines  T  is  implied  by  the  prop¬ 
erty  3  of  the  layered  graph  (Section  2).  Therefore, 
we  only  need  to  specify  the  leaf  mapping  T',  and 
C(T)  =  C{T').  If  T'  maps  k  out  of  n  leaves  of  one 
graph,  to  k  out  of  m  leaves  of  the  other,  then  knowing 
n  and  m,  we  can  encode  T'  with 

C(T')  =  log2min(n,m)  -1-  (5) 

,  m! 

+  <o<72  7 - ttT 

(m  —  k)! 

bits.  The  first  term  is  the  cost’  of  specifying  k  >  0, 
the  second  term  is  the  cost  of  selecting  k  leaves  of  the 
first  graph,  and  the  third  is  the  cost  of  selecting  k 
leaves  of  the  second  graph  and  specifying  their  permu¬ 
tation.  This  representation  essentially  assigns  equal 
prior  probability  to  each  value  k  >  0.  An  alternative 
is  to  assume  equal  prior  probability  for  each  mapping, 
which  corresponds  to  a  constant  cost 


We  prefer  the  first  representation,  since  it  is  less  pe¬ 
nalizing  to  small  values  of  k.  The  number  of  bits  saved 
with  respect  to  the  default  representation  is 


Q{H,T\M)  = 


C{H)-C{H,T\M) 
■  ^  Ktp-v)) 


-c(r) 


If  this  value  is  positive,  the  representation  based  on  M 
and  T  should  be  used  instead  of  a  default  representa¬ 
tion,  since  it  is  less  expensive. 

We  define  the  task  of  matching  a  graph  H  with  a 
model  M  as  a  problem  of  constructing  a  mapping  T, 
that  maximizes  the  value  of  Q(H,T\M).  It  is  easy  to 
show  that  the  problem  of  finding  maximal  isomorphic 
subgraph,  which  is  NP-complete,  is  a  special  case  of 
the  above  problem,  so  the  maximization  of  Q{H,T\M) 
is  NP-hard.  Therefore,  we  must  compromise  and  ac¬ 
cept  a  quasi-optimal  solution  obtained  by  heuristic 
search.  The  fast  graph  matching  heuristic  described 
below  is  an  improved  version  of  the  method  from  Segen 
(1987). 

The  procedure  for  matching  H  with  M  consists  of 
two  steps,  called  map  and  refine.  Map  finds  an  ini¬ 
tial  mapping  T  that  maximizes  an  upper  bound  of  Q, 
then  refine  iteratively  edits  T  until  Q  reaches  a  local 
maximum. 

The  function  map  uses  contextual  similarity  as  aba- 
sis  for  the  leaf  assignment.  We  can  think  of  a  vertex  v 
of  a  layered  graph  as  a  relation  l{fi ,  /a,  .../*),  where  I 
is  the  label  of  v,  and  fi,f2, ...  are  recursively  ordered 
leaf  descendants  of  v.  Similarly,  we  can  associate  with 
a  vertex  v  of  a  probabilistic  layered  graph  a  relation 
K/i>/2.  .•■/*)(  with  a  probability  p(/|v).  We  consider 
the  context  of  a  leaf  v  to  be  the  set  of  relations  asso¬ 


ciated  with  its  ancestors.  It  is  described  by  a  support 
of  the  leaf  v,  denoted  SPS{v).  A  support  of  a  leaf  v 
of  a  layered  graph,  is  a  set  of  pairs  {(R,  K)},  where  R 
describes  a  relation  by  its  label  and  the  position  of  the 
vertex  v  in  its  argument  list,  and  K  is  the  number  of 
occurrences  of  R  among  ancestors  of  v.  A  support  of  a 
leaf  V  of  a  probabilistic  layered  graph,  is  a  set  of  pairs 
{(i?,  G)},  where  R  describes  a  relation,  as  above,  and 


G  is  a  list  containing  a  value  log  for 

each  occurrence  of  R  m  an  ancestor  u  of  v,  for  which 
>  1.  This  list  is  sorted  in  a  decreasing  order. 


The  factor  divides  equally  the  .support  of  n 


among  its  leaf  descendents. 


A  similarity  S{u,  v)  between  a  leaf  w  of  a  graph,  and 
V  is  a  leaf  of  a  probabilistic  graph  is  defined  as 


mm(/C,|G|) 

5(«,^)=  E  E  (8) 

(R,K)eSPS{u)  i=0 
{n,G)eSPS(v) 
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S(ti,  v)  is  an  upper  bound  on  an  increase  of  Q  resulting 
from  adding  an  assignment  (u,  v)  to  T. 

Map  finds  a  mapping  that  maximizes  the  sum  of  sim¬ 
ilarities  between  mapped  leaves,  by  solving  an  assign¬ 
ment  problem.  This  mapping  is  iteratively  improved 
by  refine,  which  deletes  or  adds  one  assignment  in  each 
iteration,  seeking  a  maximal  increase  of  Q.  The  iter¬ 
ation  stops  when  no  single  addition  or  deletion  can 
further  improve  Q. 

6  Recognition  and  interpretation 

Given  a  library  of  probabilistic  graph  models  ML  = 
{Mo,  Ml,  M2,  —Mk] ,  and  a  graph  H,  we  might  want 
to  select  one  model  M,  which  is  in  some  way  nearest 
to  H.  We  call  this  task  recogniiion,  or  classification, 
since  each  model  from  ML  is  associated  with  a  class 
of  graphs  for  which  it  is  nearest.  Since  the  mapping  of 
H  ton  single  model  may  not  exhaust  all  the  vertices  of 
H,  we  may  want  to  map  parts  of  H  to  several  models. 
These  mappings  will  form  an  interpretation  of  H.  In¬ 
terpretation  can  be  considered  an  attempt  to  explain 
H  using  the  models  from  from  the  library  as  primi¬ 
tives.  Since  both  recognition  and  interpretation  can 
result  in  a  compressed  representation  of  H,  we  define 
both  tasks  as  problems  of  minimizing  representation 
cost. 


graph  interpretation  solves  the  recognition  problem, 
first  for  Hi  =  H,  then  for  H2,  etc.,  until  some  graph 
Hk  is  recognized  as  Mq.  Since  interpretation  requires 
multiple  recognitions,  it  will  also  benefit  from  the  use 
of  screening  in  the  recognition  search. 

7  Learning 

Models  used  for  recognition  and  interpretation  can  be 
learned  from  a  training  set  of  graphs.  If  the  graphs 
in  the  training  set  are  grouped  into  classes,  and  our 
objective  is  to  find  models  that  correspond  to  these 
classes,  we  have  a  task  of  supervised  learning.  We  can 
approach  it  in  two  ways.  First,  we  can  force  a  single 
model  to  represent  all  graphs  within  a  class.  However, 
such  a  model  may  not  perform  well  if  the  graphs  in 
a  class  are  not  sufficiently  similar.  In  such  a  case, 
we  can  try  to  divide  a  class  into  subclasses  containing 
mutually  similar  graphs,  and  use  a  separate  model  for 
each  subclass.  This  is  an  unsupervised  learning  task. 

Below  we  describe  a  method  of  forced  learning  of  a 
single  class  model  (Segen,  1988a),  and  new  clustering 
methods  for  unsupervised  learning. 

7,1  Forced  learning 


6.1  Recognition 

Recognition  is  a  problem  of  finding  a  model  m  in  ML, 
and  a  mapping  T  from  m  to  H,  that  maximizes 

Q{H,T\m)  +  \og2P{m)  (9) 

Here  P{m)  is  the  prior  probability  of  model  m,  and  we 
assume  Mq  to  be  an  empty  g  ph  always  associated 
with  a  null  mapping,  so  that  Q{H,T\Mo)  =  0.  A  sim¬ 
ple  brute  force  recognition  method  seeks  a  maximum 
of  this  expression  by  matciilng  H  with  each  model  in 
ML.  The  computational  cost  of  this  method  grows 
approximately  linearly  with  the  size  of  the  model  li¬ 
brary.  We  are  presently  experimenting  with  screening 
methods  based  on  bounds  provided  by  the  similarity 
function  (Equation  8)  to  speed  up  the  recognition  for 
large  libraries. 

6.2  Interpretation 

Interpretation  is  a  problem  of  constructing  a  sequence 
of  models  from  ML,  mi,m2,  ...m*,  where  mk  =  Mo 
and  mi  yt  Mo  for  I  <  k,  and  a  sequence  of  associated 
mappings  Ti,T2,  —Tk  which  maximize  the  value  of 

k 

^Q{Hi,Ti\mi)  +  \og^P{mi)  (10) 

1=1 

The  graphs  Hi,  i  =  1,2,  ...!•  in  the  above  expression 
are  constructed  recursively,  by  taking  Hi  ■=  H,  and 
forming  Hn+i  from  Hn,  by  removing  all  vertices  of 
Hn  mapped  to  m„  by  T„.  A  heuristic  method  for 


This  learning  procedure  incrementally  constructs  a 
probabilistic  graph  for  a  class,  from  the  sequence  of 
its  members.  Given  a  sequence  of  graphs  Hi,H2,  -.Hk, 
the  first  graph  Hi  is  converted  to  a  probabilistic  graph 
Ml ,  by  assigning  a  probability  value  to  each  vertex  la¬ 
bel,  using  the  formula  below.  The  following  graphs 
are  then  used  to  update  the  model,  which  results  in  a 
sequence  of  models  Mi,  M2,  ...Mk,  using  the  following 
match-and-merge  operation. 


A  graph  Hn+i  is  matched  with  model  M„,  giving  a 
mapping  T.  Since  T  maps  a  subset  of  leaves  cf 
to  2  subset  of  leaves  of  Hn.ui ,  generally,  some  leaves 
of  M„  and  //n+i  remain  unmapped.  The  mapping  T 
and  the  graph  are  then  extended  to  T'  and  M'  in 
the  following  way;  Initially  M'  is  set  to  M„.  For  each 
unmapped  leaf  /  of  H„+i,  a  new  leaf  /'  is  added  to 
M',  and  a  pair  (/',/)  is  added  to  T.  Then  new  higher 
level  vertices  are  added  to  M',  such  that  each  vertex  of 
M'  has  a  corresponding  vertex  in  Hn+i.  The  mapping 
of  these  vertices  is  determined  by  recursively  following 
child  to  parent  links.  Finally,  vertex-label  statistics  for 
the  vertices  of  M'  are  updated,  based  on  the  labels  of 
their  matches  in  Hn+i,  and  new  label  probabilities  are 


computed  using  the  Bayes  estimator:  P{i)  =  ■  ,  , 

71  +  « 

where  n(i)  is  the  number  of  events  in  the  i-th  category, 


n  is  the  total  number  of  events,  and  k  is  the  number  of 


categories.  The  final  form  of  graph  M'  is  then  taken 


as  A/„+i.  The  last  graph  Mk  obtained  this  way  is  used 


as  the  model  of  a  class. 
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7.2  &raph  clustering 

While  most  of  the  work  on  conceptual  clustering  is 
focused  on  attribute-value,  or  vector  representations, 
(Chfe^emah  et  al.,  1988;  Fisher,  1987a,  1987b;  Fisher 
&  Langley  1985;  Michalski  &  Stepp  1983;  Segen  & 
Sanderson,  1979;  Segen,  1980,  1988b,  1989;  Wallace  & 
Boulton,  1968),  several  methods  have  been  proposed 
for  clustering  graphs  (Levinson,  1984;  Wong  &  You, 
1985;)  or  structured  entities  that  can  be  represented 
by  graphs  (Stepp  &  Michalski  1986;  Lebowitz,  1987; 
Thompson  &  Langley,  1989;  Wogulis  &  Langley,  1989). 
Th^e  methods  share  the  requirement  for  a  user  spec¬ 
ified  parameters  or  other  subjective  devices  to  control 
the  number  of  clusters,  or  cluster  separation.  Here  we 
attempt  to  formulate  the  clustering  of  graphs  in  a  non 
ad  hoc  manner,  as  an  optimization  problem  void  of 
any  free  parameters. 

Graph  clustering  is  a  task  of  dividing  a  set  of  graphs 
into  groups,  such  that  similar  graphs  are  grouped  to¬ 
gether.  We  can  formally  define  such  a  task  as  a  prob¬ 
lem  of  minimizing  the  cost  of  representation  of  a  set  of 
graphs.  Given  a  sequence  of  graphs  I  =  Hi,  H2,  -Hn, 
we  want  to  construct  a  set  of  pronabilistic  models, 
ML  ~  Ml,  M2,  —Mk,  such  that  when  graphs  in  I  are 
represented  relative  to  models  in  ML,  the  total  cost 
of  representing  ML,  and  I  is  minimum.  The  set  of 
all  graphs  that  are  represented  relative  to  the  same 
model  is  called  a  cluster.  To  avoid  the  need  for  en¬ 
coding  label  distributions  for  probabilistic  graphs  we 
will  represent  a  model  predictively,  using  a  small  set 
of  cluster  members.  They  will  be  called  internal  mem¬ 
bers,  while  remaining  members  of  the  cluster  will  be 
called  external.  This  approach  is  related  to  Rissanen’s 
predictive  minimal  description  length  principle  (Rissa- 
nen,  1987),  but  it  does  not  depend  on  ordering  of  data 
elements. 

A  model  Mi,  i  >  0,  will  be  represented  by  a  generat¬ 
ing  sequence  of  n,-  graphs  from  I,  U  =  Hu ,  Hi2,  —Him, 
(internal  members).  The  forced  learning  algorithm  ap¬ 
plied  to  this  sequence  results  in  a  sequence  of  models 
Mii,Mi2,—Mim,  The  last  model  in  this  sequence  is 
then  used  as  the  model  Mi  of  a  cluster  i.  We  use  a 
default  representation  for  the  first  graph  in  the  gener¬ 
ating  sequence,  and  then  encode  the  following  graphs 
predictively,  i.e.,  a  graph  is  represented  relative 
to  a  model  Mij.  In  addition,  for  each  graph  in  L  we 
must  encode  its  position  in  7,  to  preserve  the  initial 
ordering  of  7. 

The  total  representation  for  ML  and  7  consists  of 
the  following  parts: 

1.  The  length  N  of  7,  and  the  number  of  clusters  K. 

2.  The  predictive  encoding  of  a  generating  sequence 
for  each  cluster, 

3.  The  position  of  each  of  n  =  ^  n,-  internal  mem¬ 
bers  in  7. 

4.  For  each  external  member  H  of  some  cluster,  a 


cluster  index  i,  and  the  representation  of  77  rela¬ 
tive  to  Mi . 

5.  For  each  free  (not  a  member  of  any  cluster)  graph 
77,  the  index  of  the  group  of  free  graphs,  and  the 
default  representation  of  77. 

An  incremental  clustering  method  called  INCLAG, 
begins  by  forming  a  cluster  from  the  first  element  of 
7,  then  it  assigns  each  successive  element  to  its  near¬ 
est  cluster,  or  creates  a  new  cluster  containing  this 
element,  based  on  the  value  of  expression  (9).  The 
nearest  cluster  is  the  one  which  gives  niaximal  value 
to  this  expression;  a  new  cluster  is  formed  if  this  value 
is  not  positive.  After  examining  the  last  element  of  7, 
all  singleton  clusters  are  eliminated,  and  their  mem¬ 
bers  become  free  graphs. 

While  INCLAG  is  simple  and  fast,  its  results  de¬ 
pend  on  the  order  of  data,  and  it  tends  to  finds  only 
well  separated  clusters.  A  more  complicated  graph 
clustering  method  called  ACLAG,  uses  an  agglomera- 
tive  procedure,  which  does  hot  dependent  on  order  of 
data.  Comparing  to  INCLAG,  it  can  better  separate 
similar  clusters,  but  it  is  also  more  expensive.  Begin¬ 
ning  with  a  default  representation  for  each  graph  in  7, 
and  0  clusters,  ACLAG  repeatedly  applies  one  of  the 
following  moves,  until  there  is  a  single  cluster  contain¬ 
ing  all  elements  of  7: 

1.  Form  a  new  cluster  from  a  pair  of  free  graphs. 

2.  Assign  a  free  graph  to  one  of  the  clusters,  as  an 
external  member. 

3.  Merge  two  clusters  by  assigning  members  of  the 
first  cluster  to  the  second  cluster  as  external  mem¬ 
bers,  and  removing  the  first  cluster-. 

Each  iteration  selects  the  move,  which  results  in  a 
minimal  representation  cost  after  the  move,  even  if 
this  cost  increases.  When  a  new  member  is  added  to 
a  cluster,  ACLAG  attempts  to  reduce  its  cost  by  ap¬ 
pending  external  members  to  the  generating  sequence. 
The  final  result  is  the  best  among  the  examined  sets 
of  clusters. 

8  Application  example:  Learning 
shape  models 

The  methods  presented  in  this  paper  have  been  applied 
to  modeling,  recognition  and  interpretation  of  planar 
shapes.  Details  of  this  application  will  be  described 
in  a  separate  paper,  and  here  we  show  only  several 
examples  to  illustrate  the  methodology,  and  give  a  feel 
of  its  practical  potential. 

A  general  task  is  to  learn  models  of  shape  classes 
from  a  training  set,  consisting  cf  shape  examples  and 
their  class  names,  and  to  use  these  models  to  explain 
new  simple  or  composite  shapes.  To  represent  shape 
in  a  layered  graph  structure  we  need  to  define  primi¬ 
tive  parts  that  will  form  leaves  of  the  graph,  and  re¬ 
lations,  to  be  represented  by  higher  level  vertices.  In 
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our  representation  (Segen,  1988a),  the  primitives,  rep¬ 
resented  by  leaves,  are  points  of  extremal  curvature  of 
the  contour,  along  with  their  local  tangent  and  cur¬ 
vature.  Higher  level  vertices  represent  geometric  re¬ 
lations  among  the  primitives,  based  on  distances  and 
angles:  First  level  correspond  to  binary  relations,  sec¬ 
ond  level  vertices  to  fourth  order  relations,  etc.  Figure 
2  shows  a  shape  with  labeled  points  of  extremal  cur¬ 
vature  (leaves),  and  labeled  binary  relations,  i.e.  the 
first  two  layers  of  a  graph.  A  complete  three  layer 
graph  representing  this  shape  (without  vertex  labels) 
is  shown  in  Figure  3. 

In  one  application  we  used  5  classes  of  such  shapes. 
Figure  4  shows  one  example  of  each  of  these  classes. 
The  training  set  consisted  of  270  examples  (54  per 
class),  and  the  test  set  of  146  shapes.  When  we  ap¬ 
plied  the  forced  learning  method  and  used  the  result¬ 
ing  models  to  recognize  test  examples,  105  examples 
(about  72%)  were  recognized  correctly.  After  group¬ 
ing  the  training  examples  by  incremental  clustering 
(INCLAG),  the  number  of  correctly  recognized  cases 
increased  to  127  (about  87%). 

To  see  how  automatic  clustering  relates  to  our  sub¬ 
jective  classification,  we  applied  the  agglomerative 
clustering  method  (ACLAG)  to  a  mixed  collection  of 
training  examples.  The  result  in  Figure  5  shows  that 
classes  are  subdivided  with  only  few  misplacements. 

9  Conclusion 

This  paper  applied  the  criterion  of  minimal  represen¬ 
tation  to  formulate  graph  matching,  classification,  in¬ 
terpretation,  and  learning  probabilistic  graph  mode's 
as  combinatorial  optimization  problems.  The  meth¬ 
ods  proposed  to  approximately  solve  these  problems 
rely  on  an  efficient  graph  matching  heuristic.  These 
methods  are  quite  practical,  and  have  been  applied  to 
recognition  and  interpretation  of  nonrigid  shapes  us¬ 
ing  real,  noisy  data.  Possible  extensions  of  this  work 
may  include  other  types  of  graphs  and  hypergraphs, 
in  particular  undirected  graphs.  However,  more  im¬ 
mediately  needed  is  the  analysis  accuracy  and  speed 
of  presented  heuristics. 
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Abstract 

In  spite  of  the  importance  of  representation 
in  leeirning,  little  progress  has  been  made  to¬ 
ward  understanding  what  makes  representar 
tions  work.  This  paper  describes  a  frame¬ 
work  for  knowledge-level  analysis  of  changes 
in  the  representation  of  training  examples  in 
concept  learning.  This  a  very  fundamental 
sort  of  representation  change;  such  a  change 
alters  the  very  space  over  which  learning  oc¬ 
curs,  and  hence  necessitates  selection  of  a 
new  hypothesis  space  and  (probably)  a  new 
learning  algorithm.  The  goals  of  this  paper 
are  first,  to  provide  a  framework  for  analy¬ 
sis  of  representation  shifts;  second,  to  make 
explicit  the  assumptions  implicit  in  represen¬ 
tation  shifts  that  have  actually  been  used 
in  learning  systems;  and  third,  to  suggest  a 
procedure  for  finding  the  most  appropriate 
representation  shift,  given  some  background 
knowledge  about  a  learning  problem.  The 
analytic  framework  is  used  to  analyze  a  class 
of  hybrid  EBL/SBL  systems  by  characteriz¬ 
ing  the  sorts  of  domain  theories  that  can  be 
used  with  these  systems. 


1  Introduction 

Few  will  take  issue  with  the  claim  that  the  choice  of  an 
appropriate  representation  is  crucial  to  success  in  solv¬ 
ing  AI  problems.  This  is  particularly  true  for  learn¬ 
ing  problems.  However,  in  spite  of  the  importance 
of  representation  in  learning,  little  progress  has  been 
made  towards  understanding  what  makes  representa¬ 
tions  work:  determining  whether  a  representation  will 
or  will  not  help  in  solving  a  learning  problem  is  still 
by  and  large  a  black  art.  In  particular,  there  is  no  ac¬ 
cepted  procedure  for  answering  questions  about  rep- 
rcscntoticn  change  such  oS.  when  is  a  iCpicsuntatioii 
appropriate  for  a  learning  problem?  what  assumptions 
are  being  implicitly  made  when  a  particular  represen¬ 
tation  is  being  used?  given  some  background  knowl¬ 
edge  about  a  learning  problem,  which  representations 
are  good,  which  are  bad,  and  which  are  optimal? 

The  subject  of  this  paper  Ls  representation  change  in 


teaming;  in  particular,  changes  in  the  representation  of 
training  examples  in  concept  learning  problems.  This, 
is  in  contrast  to  previous  work  in  representation  change 
in  learning,  which  has  primarily  addressed  the  ptobr 
lem  of  changing  the  representation  of  hypotheses  (t.e., 
shift  of  inductive  bias.)  Since  a  hypothesis  is  a  set  of 
examples,  a  change  in  the  representation  of  training 
examples  requires  change  in  inductive  bias;  however, 
being  able  to  change  the  language  used  to  describe 
examples  as  well  as  the  language  used  to  describe  hy¬ 
potheses  makes  possible  more  radical  representation 
shifts. 

The  problem  of  automatically  performing  represen¬ 
tation  shifts  of  this  kind  is  outside  the  scope  of  this 
paper.  The  goals  of  this  paper  are; 

1.  to  provide  a  framework  for  analysis  of  representar 
tional  choices  made  by  humans, 

2.  to  uncover  the  assumptions  implicit  in  c  't.rtain  rep¬ 
resentation  shifts  that  are  used  in  real  learning 
systems,  and 

3.  to  suggest  a  methodology  for  selection  of  an  ap¬ 
propriate  representation  shift,  given  some  back¬ 
ground  knowledge  about  the  learning  problem. 

All  analysis  is  done  at  the  “knowledge  level”  [Nowell, 
1982);  consideration  is  given  only  to  when  learnabilr 
ity  is  made  possible  or  impossible,  not  to  when  it. 
is  made  easy  or  difficult.  This  enables  our  results 
to  be  independent  of  particular  learning  algorithms 
and  particular  learnability  criteria;  however,  it  also  re¬ 
stricts  the  analysis  to  representation  shifts  that  are  not 
information-preserving. 

Proofs  of  theorems  are  included  in  an  appendix. 

2  Motivation  and  Overview 

Consider  a  learning  program  with  the  architecture 
shown  in  figure  1.  First,  examples  are  mapped  from 
an  initial  space  Xj  into  the  representation  space  Xjt; 
llicii,  cl  Icaiiihig  alguiUhui  Is  run  lu  the  representation 
space;  finally,  the  concept  learned  in  the  representar 
tion  space  is  translated  back  to  the  initial  space.  Such 
a  lefirning  system  is  shifting  the  representation  of  ex¬ 
amples  from  Xj  to  Xji  using  the  function  /r. 

Several  learning  systems  of  this  sort  have  been  built 
[Cohen,  1988;  Cohen,  1990b;  Flann  and  Dietterich, 
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Figure  1:  A  learning  system  using  representation  shift 


1989;  Hirsh,  1989].  One  advantage  of  this  architec¬ 
ture  is  that  it  allows  background  knowledge  to  be  ap¬ 
plied  to  a  leeirning  problem  in  a  highly  modular  way. 
Background  knowledge  is  used  only  to  telect  the  space 
over  which  learning  will  occur,  via  the  representation- 
shifting  function  fa',  standard  concept  learning  tech¬ 
niques  can  then  be  applied  in  the  new  space  Xr. 

A  small  notational  issue  should  be  clarified  at  this 
point.  In  this  paper,  we  will  adopt  the  convention  that 
if  /  is  a  function  with  domain  X  and  if  5  C  X,  then 
f  {S)  denotes  the  set  {/(®) :  ®  €  5}.  This  will  greatly 
simplify  the  notation  in  situations  where  there  are  two 
parallel  functions  that  must  be  considered,  one  which 
maps  instances  to  instances,  and  one  which  maps  con¬ 
cepts  to  concepts.  An  example  of  this  appears  in 
Figure  1,  which  actually  contains  two  representation- 
shifting  functions:  /r,  which  maps  examples  from  the 
initial  space  to  the  representation  space,  and  an  “in¬ 
verse”  of  fn,  denoted  in  the  figure,  which  maps 
concepts  in  the  representation  space  back  to  concepts 
in  the  initial  space.  The  function  is  of  course  not 
a  true  inverse  of  /r;  rather,  it  it  the  inverse  of  /r 
extended  to  concepts  using  the  convention  described 
above. 

The  effectiveness  of  such  a  learning  system  depends 
on  the  representation  shift.  The  first  issue  addressed 
in  this  paper  is:  when  is  a  representation  shift  appro¬ 
priate  for  a  learning  problem?  In  short,  what  are  the 
principles  that  underlie  representation  shift?  We  es¬ 
tablish  a  correspondence  called  kno.wledge-levcl  equiv¬ 
alence  between  a  representation  and  the  assumptions 
about  the  learning  problem  that  are  implicitly  made 
when  that  representation  is  selected.  This  toiicspon- 
dence  is  the  basis  of  our  ^alytic  framework. 

The  correspondence  can  be  used  in  three  ways. 

•  Given  a  representation  shiftj  it  is  possible  to  iden- 
tffy  the  background  knowledge,  to  which  a  repre¬ 
sentation  shift  corresponds.  In  other  words,  it  is 
possible  to  uncover  the  assumptions  implicit  in 
the  use  of  a  specific  representation  shift. 

•  Conversely,  given  some  background  knowledge 
about  a  learning  problem,  it  is  possible  to  test 


whether  a  given  representation  shift  is  necessar¬ 
ily  appropriate,  by  testing  whether  the  knowledge 
which  corresponds  to  the  representation  shift  is 
implied  by  the  background  knowledge. 

«  Finally,  if  one  assumes  that  it  is  desirable  to  use 
a  representation  shift  which  corresponds  to  the 
strongest  possible  knowledge,  it  is  possible  to  de¬ 
termine  if  a  representation  shift  is  optimal  with  re¬ 
spect  to  some  given  background  knowledge  about 
a  learning  system.  This  sf.ggests  a  procedure 
for  constructing  a  representation  shift  given  some 
background  knowledge  about  a  learning  problem. 
Such  a  representation  shift  is  a  meeins  of  incor¬ 
porating  background  knowledge  into  a  concept 
learning  problem. 

Two  applications  are  given  of  our  framework.  In  sec¬ 
tion  4.2,  we  find  the  assumptions  implicit  in  the  use 
of  a  common  representational  shift:  a  shift  from  ex¬ 
amples  to  the  “explanation  structures”  generated  by 
the  explanation  based  generalization  (BBG)  algorithm 
[Mitchell  et  al,  1986].  This  shift  is  found  to  corre¬ 
spond  to  an  assumption  about  the  correctness  of  BBG. 
In  section  4.3,  we  relax  this  assumption  of  correct¬ 
ness,  and  construct  a  new  representation  shift  which  is 
knowledge-level  equivalent  to  the  releixed  assumption. 
This  representation  shift  could  be  used  as  the  basis 
of  a  learning  system  that  makes  a  weaker  assumption 
about  the  correctness  of  BBG. 

3  A  framework  for  analyzing 
representation  shift 

ihe  question  addressed  in  this  section  is.  when  is  a 
representation  shift  appropriate  to  a  particular  class 
of  learning  problems?  The  meaning  of  this  question 
is  intuitively  clear.  However,  in  order  to  formulate 
and  answer  the  question  above  precisely,  it  is  necessary 
to  formally  defiiie  the  terms  involved:  representation 
shift,  learning  problem,  and  appropriate. 

3.1  What  is  a  representation  shift? 

A  representation  shift  will  be  formalized  as  a  function 
/r  from  an  original  initial  space  Xj  into  another  space 
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Xm  called  the  representation  space. 

Representation  shift  =  a  function  /r  :  X/  — *  Xr 

In  a  typical  case,  Xj  is  a  set  of  potential  training  ex¬ 
amples  in  some  initial  description  language,  which  is 
inappropriate  for  learning.  Xr  contains  representa¬ 
tions  of  the  elements  of  Xj,  in  what  is  hopefully  a 
better  format  for  learning.  For  example  Xj  might  be 
a  set  of  digitized  photographs  of  soybean  plants,  and 
Xr  might  consist  of  feature-vector  representations  of 
these  plants  using  certain  features  known  to  be  signif¬ 
icant  for  the  learning  task  at  hand,  as  in  [Michalski 
and  Chilausky,  1980];  or  Xi  might  be  a  set  of  chess 
positions,  and  Xr  might  be  proofs  that  white  has  a 
forced  exchange  from  that  position,  as  in  [Flann  and 
Dietterich,  1989]. 

3.2  What  is  a  learning  problem? 

A  definition  of  a  “class  of  learning  problems^  must  now 
be  given.  It  is  assumed  that  the  problem  faced  by  the 
learner  is  to  correctly  identify  some  unknown  target 
concept  T  given  some  “partial  information”  about  T. 
Typical  examples  of  the  sorts  of  partial  information 
available  might  be  a  set  of  answers  to  queries  issued 
by  the  learner,  or  a  set  of  labeled  examples  of  members 
and  nonmembers  of  T. 

We  would  like  to  exclude  from  our  formalization  of 
a  learning  problem  anything  specific  to  a  particular 
learning  algorithm  or  a  particular  model  of  learnabil- 
ity.  One  way  to  do  this  is  to  assume  that  the  partial 
information  is  requested  by  the  learner  (for  example, 
via  a  set  of  calls  to  oracles)  and  is  not  an  explicit  in¬ 
put  to  the  learner.  (It  seems  reasonable  to  assume 
that  for  every  learner  that  takes  partial  knowledge  as 
an  explicit  input,  there  is  a  learner  that  requests  this 
input  from  an  oracle.)  In  this  case,  a  learning  problem 
is  simply  a  possible  target  concept  T,  and  a  class  of 
learning  problems  is  simply  a  set  of  such  concepts. 

Class  of  learning  problems  =  a  set  T’  =  {Ti,  T2, . . . , } 

3.3  When  is  a  representation  appropriate? 

Given  these  definitions,  it  is  now  possible  to  state  pre¬ 
cisely  when  a  representation  is  appropriate.  A  repre¬ 
sentation  /r  is  appropriate  for  the  class  of  learning 
problems  V  if  and  only  if  each  learning  problem  is  po¬ 
tentially  solvable  in  the  representation  space.  A  lean¬ 
ing  problem  T  is  potentially  solvable  if  it  is  possible 
to  exactly  describe  a  target  concept  T  by  an  image  of 
T  under  /r.  If  we  let  APPROP(fR,'P)  denote  the 
proposition  “/r  is  appropriate  for  V”,  and  let  Xr  be 
the  range  of  /r,  then  the  definition  of  appropriateness 
can  be  stated  in  symbols: 

APPROPifR,?)  = 

/r  is  a  well-defined  function  A 
VTeP,  3TrCAr;T  =  /;j1(Tr) 

An  alternative  definition  which  some  readers  may  find 
more  intuitive  is 

APPROP{fR,V)  = 


fR  is  a  well-defined  function  A 
vre?’,  '^x,yeXi, 
ifnix)  =  fniy)  ^t(®)  =  ^riv)) 

where  Xt(,x)  denotes  the  characteristic  function  of  T 

'  ,e  definition  of  potentially  solvable  assumes  that 
ij.o  goal  of  learning  is  exact  identification  of  the  target 
concept.  This  fits  many  formal  definitions  of  learn- 
ability,  such  as  Gold’s  learnability  in  the  limit  [Gold, 
1967],  Littlestone’s  mistake-bounded  learnability  [Lit- 
tlestone,  1988],  and  Angluin’s  definitions  of  learning 
from  queries  (Angluin,  1988].  An  exception  is  the 
Valiant  criterion  of  probably  approximately  correct 
(pac)  learnability  [Valiant,  1984],  which  requires  only 
approximate  identification  of  the  target  concept;  this 
suggests  that  a  probabilistic  extension  of  this  analytic 
framework  may  be  worth  investigating. 

Notice  that  the  appropriateness  of  /r  does  not  say 
anything  about  how  easy  or  difficult  it  is  to  solve  the 
learning  problem  in  the  representation  space  Xr;  it 
merely  says  that  it  is  still  possible  to  solve  the  learning 
problem. 

3.4  Representations  and  background 
knowledge 

Let  AlfP)  denote  a  first  order  sentence  which  is  a  state¬ 
ment  about  V;  A(fP)  will  alternatively  be  viewed  as  an 
assumption  about  the  class  of  learning  problems,  or 
as  background  knowledge  about  the  class  of  learning 
problems.  We  can  now  formulate  precisely  and  answer 
the  questions  given  in  the  introduction: 

Question  It  Given  a  representation  /r  what  as¬ 
sumptions  are  implicitly  made  when  /r  is  used  on  the 
class  of  learning  problems  V  ? 

The  answer  to  this  question  is  obvious:  any  learn¬ 
ing  system  that  uses  a  representation  fR  makes  the  as¬ 
sumption  that  fR  is  appropriate  for  the  learning  prob¬ 
lem;  t.e.,  that  APPROP{fR,'P)  is  true.  A  particular 
learner  may  (and  probably  will)  make  other  assump¬ 
tions  as  well  in  the  process  of  generating  a  hypothe¬ 
sis;  however,  since  every  learning  system  that  uses  the 
representation  /r  assumes  APPROPifRt'P),  then  it 
seems  reasonable  to  characterize  APPROP{fR,P)  as 
the  assumptions  made  in  choosing  the  representation 
/«■  \ 

Question  2a:  Given  some  background  knowledge 
A{P)  about  a  class  of  learning  problems  V,  what  rep¬ 
resentations  /r  can  be  used? 

Question  2b:  In  the  same  circumstances,  what 
representations  /r  should  be  used? 

The  answer  to  question  2a  is  also  obvious:  if 

A{V)  APPROP{fR,P)  (1) 

then  the  representation  /r  could  be  used  for  the  learn¬ 
ing  problem  V,  given  the  background  knowledge  A(fP), 
The  answer  to  2b  is  not  obvious,  however.  There 
may  be  many  representations  that  have  this  property; 
which  of  these  should  be  used?  In  short,  are  there  any 
grounds  for  preferring  one  of  these  representations  over 
others? 
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One  context  in  which  /h,  is  preferable  to  /r,  is  when 
the  appropriateness  of  /r,  implies  the  appropriateness 
of  /r„  i.e. 

APPROP{fa,,V)  ^  APPROPifn^P)  (2) 
The  rational  for  preferring  /r,  is  the  following;  con¬ 
sider  a  learning  system  Li  that  uses  /r,  ,  and  another 
learning  system  that  uses  /r,.  Li  ‘‘knows”  only 
that  /r,  is  appropriate,  and  £2  “knows”  only  that  /r, 
is  appropriate.  Then  if  equation  2  holds,  everything 
known  by  £2  **  deductively  implied  by  what  is  known 
by  £1,  and  can,  at  least  in  principle,  be  derived  from 
knowledge  available  to  £1.  At  the  knowledge  level,  £1 
is  “better  informed  about  the  learning  problem”  than 
£2. 

This  rule  of  preference  means  that  logical  implica¬ 
tion  imposes  a  partial  order  on  the  desirability  of  rep¬ 
resentations.  The  best  representations  are  those  /r 
that  satisfy  equation  1  and  are  maximal  with  respect 
to  this  partial  order.  It  is  easy  to  show  that  if  /r  has 
the  property 

A{V)  ^  APPROP{fR,V) 

then  it  fulfills  this  requirement.  Such  a  representa¬ 
tion  will  said  to  be  knowledge-level  equivalent  (KL- 
equivalent)  to  the  assumption  The  answer 

to  question  2b  is  thus:  given  AiV),  the  most  desir¬ 
able  representation  is  some  /r  such  that  /r  is  KL- 
equivalent  to  AiV). 

Notice  that  /r  is  not  unique:  that  is,  given  an  as¬ 
sumption  A(p),  there  is  more  than  one  /r  that  is  KL- 
equivalent  to  A{P).  In  particular,  if  j  is  any  one-to- 
one  function  and  /r  is  KL-equivalent  to  A{P),  then 
/n  ®  ff  is  also  KL-equivalent  to  A{T),  This  is  to  be 
expected:  if  jr  is  one-to-one,  then  it  is  an  “information¬ 
preserving”  operation  and  should  make  no  difference 
at  the  knowledge  level.  However,  the  following  the¬ 
orem  shows  that  this  is  the  only  sort  of  variation  of 
/r  which  is  possible:  i.e.,  that  the  KL-equivalent  rep¬ 
resentation  for  an  assumption  Al^)  is  unique  up  to 
composition  with  one-to-one  functions. 

Theorem  1  If  fIl^  and  /r,  are  both  KL-equivalent  to 
A{V),  then  there  is  a  one-to-one  function  g  such  that 
fRi  =  fn.  Off- 

The  analytic  framework  presented  in  this  section  is 
summarized  in  table  1. 

4  Applications  of  the  framework 

The  true  value  of  an  analytic  framework  lies  in  its  use¬ 
fulness  in  the  analysis  of  meaningful  problems.  We  will 
consider  two  uses  of  the  framework.  First,  in  section 
4.2,  we  find  the  assumption  implicit  in  the  use  of  a 
common  representational  shift.  This  is  a  simple  anal¬ 
ysis;  it  merely  requires  finding  a  meaningful  condition 
which  is  logically  equivalent  to  APPROP{fR,P).  In 
section  4.3,  we  consider  the  related  problem  of  con¬ 
structing  a  representation  shift  which  is  optimal  given 
a  particular  assumption  about  a  learning  problem. 
Here,  given  an  assumption  about  the  learning  prob¬ 
lem,  the  problem  is  to  construct  a  representation  shift 
which  is  KL-equivalent  to  this  assumption. 


The  representation  shifts  and  assumptions  which 
we  consider  are  closely  connected  to  the  operation  of 
explanation  based  generalization  (EBG),  a  procedure 
that  uses  a  logical  theory  to  generalize  a  single  positive 
example  of  some  concept.  A  short  discussion  of  EBG 
precedes  our  analyses. 

4.1  Explanation  based  generalization 

To  avoid  unnecessary  detail,  explanation  based  gen¬ 
eralization  will  be  discussed  at  a  rather  abstract 
level  in  this  paper.  Readers  requiring  more  detail 
are  referred  to  [Kedar-Cabelli  and  McCarty,  1987; 
Mitchell  ei  al,  1986].  Let  0  be  a  theory,  ord  for 
X  €  Xj,  let  PROOF Sq{x)  denote  the  set  of  proofs 
of  X  in  0.  This  assumes  that  x  is  a  logical  goal, 
for  example  cup(objl)\  such  an  example  is  implic¬ 
itly  tagged  with  the  “target  concept”  to  which  it  is 
relevant.  Let  O  denote  an  operationality  predicate; 
in  this  context,  an  operation^ity  predicate  tells  if  a 
subproof  is  a  “detail”  not  relevant  to  the  concept  to 
be  learned.  Explanation  based  generalization  is  an 
operation  defined  on  a  proof  p*  G  PROOFSq{x) 
and  denoted  by  the  function  EBG{Q,0,px)-  This 
function  always  returns  a  set  that  generalizes  x;  i.e,, 
Vx  e  Xi,  {x}  c  EBGie,  o,pr)  c  Xi. 

Notice  that  the  usual  assumptions  about  0  —  e.g., 
that  it  is  complete  and  correct,  that  it  is  defined  only 
for  positive  examples  —  have  not  been  made.  The 
reason  is  that  the  assumptions  which  an  EBL  system 
makes  about  0  depend  in  part  on  how  generalizations 
are  used.  The  remainder  of  this  section  will  consider 
several  possible  schemes  for  using  EBG,  and  analyti¬ 
cally  determine  which  assumptions  about  the  domain 
theory  are  being  made. 

There  are  several  known  algorithms  for  explanation 
based  generalization.  GenereJly,  a  proof  p*  is  gener¬ 
alized  in  two  ways;  operational  subtrees  are  pruned 
in  accordance  with  the  O  predicate,  and  sc  -ne  of  the 
constants  appearing  in  the  proof  are  replaced  with 
variables.  This  generalized  proof  is  called  an  expla¬ 
nation  structure,  and  will  be  denoted  eo(Px),  A  rule 
is  then  extracted  from  the  explanation  structure  by 
conjoining  the  leaves  of  the  explanation  structure  to 
form  the  antecedent,  and  using  the  root  of  the  expla¬ 
nation  structure  as  the  consequent.  The  set  of  goals 
provable  directly  from  this  rule  is  the  generalization  of 

X. 

As  an  example  of  explanation  based  generaliza¬ 
tion,  [DeJong  and  Mooney,  1986]  describes  a  simple 
theory  of  social  interactions,  under  which  the  goal 
kill(john,john)  is  generalized  to  form  the  rule 

kill(X,  X)  <—  depres8ed(X)  A  buy(X,  W)  A  gun(W) 

which  can  be  paraphrased  as  saying  “A  will  kill  him¬ 
self  if  he  is  depressed  and  has  bought  a  gun.”  Here 
the  example  is  the  goal  kill(john,john)  and  the  gener¬ 
alization  is  the  set  of  goals  provable  by  the  rule  above, 
i.e.  the  set  of  goals  of  the  form  kill(X,X)  such  that 

depressed(X)  A  buy(X,  W)  A  gun(W) 

is  provable.  In  this  case,  the  generalization  corre¬ 
sponds  loosely  to  the  concept  of  “suicide”. 
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Table  1:  Summary  of  the  analytic  framework 


intuitive  Notion 

Form^llization 

representation  shift 

learning  problem 

class  of  learning  problems 

background  knowledge  of  learning  problem 

fjt  is  appropriate  for  V 

a  function  fn:  Xj  —*  Xji 
a  target  concept  T  C  Xj 
asetP  =  {Ti,T2,...} 
a  first  order  formula  AifP) 
APPROP{fR,V) 

assumption  implicitly  made  in  using  fn 
fn  can  be  used  given  AIV) 
fn  should  be  used  given  A('P) 

AFPA0P{fR,V) 

Air)  APPROP{fR,r) 

Air)  ^  APPROPifR,r) 

(Le-t  fn  is  KL-equivalent  to  Air)) 

Later  analysis  will  not  depend  on  the  precise  algo¬ 
rithm  used  for  BBG;  however,  the  following  property 
is  assumed  to  hold.^ 

Assumption  1  For  all  0,  O,  px,  and  y, 

yeEBG(e,0,px) 

3p^  e  PROOFSely) :  eoiPy)  =  eo{px) 

4.2  Analysis  of  a  representation  shift 

An  important  topic  in  machine  learning  research  is 
integration  of  explanation  based  and  similarity  based 
learning  techniques.  Several  integrated  learning  sys¬ 
tems  have  been  built  that  have  the  architecture  of  fig¬ 
ure  1,  and  that  use  as  a  representation-shifting  func¬ 
tion  some  close  variant  of  the  fimction  f{x)  =  eo{px) 
These  systems  learn  from  a  set  of  explanation  struc~ 
tures  using  SBL  techniques;  they  differ  primarily  in  the 
learning  methods  used.  (See  (Flann  and  Dietterich, 
1989]  for  a  discussion  of  some  of  these  systems.) 

What  assumptions  are  made  by  a  learning  program 
that  that  uses  this  representation  shift?  The  answer 
APPROP{f,V)  is  correct,  but  not  very  meaningful. 
However,  it  can  be  shown  that  APPROP{f,  V)  is  logi¬ 
cally  equivalent  to  a  meaningful  assumption  about  the 
behavior  of  EBG. 

Theorem  2  Letfo^o  be  defined  as  /o,ci(a:)  =ecj(px). 
Then 

APPROP{fQ,o,T)  = 

Vi  €  Xi,  \eoiPROOFSQ{x))\  =  1  A 
VT  G  P,  Va:  €  T,  Vp*  G  PROOFSoix), 
EBGie,0,Px)CT 


'This  assumption  should  hold  for  any  rRa.snnabIe  im¬ 
plementation  of  EBG.  The  “only  if”  direction  requires  only 
that  the  output  of  EBG  be  determined  by  0,0,  and  ecj(px): 
in  this  case,  if  3p,  €  PROOFSQ^y)  :  ecj(Py)  =  eoipx), 
then  EDG{Q,0,py)  =  BDG{Q,0,px).  Wc  know  that 
y  €  EBG{©,0,py),  and  hence  y  €  EDG(Q,  0,p,)  as  well. 
The  “if”  direction  is  also  intuitively  clear,  if  the  conjunc¬ 
tion  that  describes  the  set  EBG{Q,0,Px)  represents  pre¬ 
conditions  under  which  the  “abstract  version”  eolPx)  of 
the  proof  p*  is  applicable:  if  y  is  in  this  set,  those  precon¬ 
ditions  arc  true,  and  some  instantiation  of  the  “abstract 
proof”  eo(pr)  should  succeed  for  y  as  well  as  for  *. 


This  restatement  of  APPROP  shows  that  it  is 
equivalent  to  two  assumptions  about  EBG.  The  first 
assumption  is  that  every  instance  has  at  most  a  single 
proof,  up  to  the  level  of  detail  specified  by  the  expla¬ 
nation  structure.  Since  this  constraint  is  usually  en¬ 
forced  by  ensuring  that  there  is  at  most  one  proof  for 
every  instance  in  the  domain  theory,  we  will  call  this 
assumption  the  unique  proof  assumption.  The  second 
assumption  is  about  the  “validity”  of  generalizations 
of  positive  examples.  Intuitively,  the  generalization  of 
any  a:  G  T  is  valid:  i.e.,  it  is  a  correct  generalization 
of  X  with  respect  to  T.  Notice  that  this  assumption  is 
much  weaker  than  the  assumption,  commonly  associ¬ 
ated  with  EBL  systems,  that  the  theory  is  a  complete 
and  correct  definition  of  the  target  concept  T. 

Theorem  2  has  the  interesting  corollary  that  /  is 
KL-equivalent  to  the  assumption  above  (t.e.,  the  con¬ 
junction  of  the  unique  proof  assumption  and  the  valid¬ 
ity  assumption.)  This  means  that  if  L'  is  any  hybrid 
SBL/EBL  concept  learning  system  which  makes  these 
assumptions,  and  which  assumes  nothing  else  about 
the  learning  problem,  then  a  learner  that  uses  the  rep¬ 
resentation  shift  /©,©  will  have  access  to  exactly  the 
same  knowledge  that  is  available  to  L'.  This  is  true 
regardless  of  the  architecture  of  L'. 

4.3  Constructing  a  representation  shift 

Given  the  analysis  above,  it  is  natural  to  ask:  in  what 
ways  can  these  assumptions  about  EBG  be  relaxed? 
One  class  of  theory  that  violates  these  assumptions 
and  that  appears  to  occur  often  in  practice  is  the  class 
of  “abductive”  theories:  theories  that  make  assump¬ 
tions  in  the  process  of  theorem  proving.  These  theories 
often  violate  the  unique  proof  assumption.  A  common 
subclass  of  abductive  theories  is  the  class  of  theories 
that  involve  plan  recognition;  often,  any  of  several  as¬ 
sumptions  could  be  introduced  that  would  suffice  to 
explain  an  action,  but  only  explanations  based  on  the 
right  a.s8umption  will  be  correct. 

The  crucial  property  of  an  abductive  theory  seems  to 
be  that  for  every  true  observation,  there  is  some  cor¬ 
rect  explanation.  It  seems  reasonable  to  require  that 
a  gencredization  associated  with  a  valid  explanation  is 
valid.  This  “abductive  assumption”  about  the  correct- 
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ness  of  EBG  can  be  precisely  stated  as: 

ABo^oiV)^ 

vr  €  P,  V®  e  T, 

3px  e  PROOFSq{x)  :  EBG{Q,0,p^)  C  T 

A  reasonable  application  of  our  framework  would  be 
to  find  a  representation  shift  which  is  optimal  under 
this  assumption:  that  is,  a  representation  shift  which 
is  KL-equivalent  to  ABe,o{'P),  A  learning  algorithm 
using  this  representation  would  have  the  advantages  of 
systems  which  use  the  representation  shifting  function 
/,  but  would  be  useful  even  when  the  unique  proof 
assumption  does  not  hold. 

In  order  to  find  a  representation  that  is  KL- 
equivalent  to  this  assumption,  it  appears  necessary  to 
make  some  assumptions  about  V  as  well.  Let  us  say 
that  a  set  5  C  2^  is  a  hiisei  iff  35o  C  X  :  S  =  {R  : 
3r  €  iJ  n  So).  A  class  of  learning  problems  V  will  be 
said  to  have  hiisei  images  under  /jj  iff  VT  6  P,  /ii(T) 
is  a  hitset. 

Theorem  3  Lei  ihe  funciion  /q  q  (x)  he  defined  as 
={eo(p*)  :  Px  G  PROOFSq{x)}.  IfV  has 
hiisei  imaged  under  /q  ^ ,  ihen  /g  ^  ts  KL-equivaleni 
io  ABo^oiP)- 

Discussion  of  learning  algorithms  for  the  representa¬ 
tion  space  which  is  the  range  of  /*  is  outside  the  scope 
of  this  paper;  however,  the  A-EBL  technique  described 
in  [Cohen,  1989;  Cohen,  1990a;  Cohen,  1990b]  can  be 
viewed  such  a  learning  algorithm. 

A  few  remarks  are  in  order  about  the  theorem  above, 
in  particular  about  the  requirement  that  T  have  “hit- 
set  images”.  First,  this  property  generally  holds  if  7^ 
contains  only  concepts  that  can  be  expressed  using  a 
set  of  rules  produced  by  EBG;  in  particular,  it  can  eas¬ 
ily  shown  that  if  assumption  1  holds,  then  the  image 
under  /*  of  the  set  EBG{Q,0,Px)  is  always  a  hitset. 

Second,  the  proof  of  the  theorem  shows  that  the 
property  of  having  hitset  images  is  noi  necessary  to 
establish  that 

AS0,oiP)  =>  APPROP{f^^o,'P) 

The  property  is  only  necessary  for  the  other  side  of 
the  logical  equivalence  which  is  the  definition  of  KL- 
equivalence;  that  is,  having  hitset  images  is  only  nec¬ 
essary  to  show  that 

APPROP{f^^o,'P)  AB0,oiV) 

In  other  words,  the  function  /*  will  be  an  appropriate 
representation  shift  whenever  AB{P)  holds;  the  condi¬ 
tion  that  V  has  hitset  images  is  only  needed  to  show 
that  /*  is  the  best  representation  shift. 


*Of  course,  the  assumption  that  P  has  hitset  images 
could  be  simply  added  to  AB(P).  This  presentation  em¬ 
phasizes  the  fact  that  this  assumption  is  motivated  by  the 
desire  to  construct  a  KL-equivalent  representation  shift 
rather  than  by  a  natural  assumption  about  the  learning 
problem. 


4.4  Discussion  of  the  results 

Any  system  which  uses  EBG  must  make  an  assump¬ 
tion  about  the  domain  theory  which  is  used;  the  as¬ 
sumption  that  is  commonly  associated  with  EBG  is 
that  the  domain  theory  is  a  complete,  correct,  and 
tractable  definition  of  the  target  concept.  In  recent 
years,  several  hybrid  learning  systems  have  been  built 
that  have  the  architecture  of  figure  1,  and  which  use  as 
a  representation-shifting  function  a  mapping  from  in¬ 
stances  to  the  explanation  structures  associated  with 
those  instances.  Presumably,  these  systems  make  a 
weaker  assumption  about  the  correctness  of  the  do¬ 
main  theory. 

The  analysis  above  confirms  this  intuition.  It  shows 
that  the  domain  theory  used  by  such  a  learning  sys¬ 
tem  must  satisfy  two  conditions:  every  instance  must 
have  exactly  one  associated  explanation  structure,  and 
EBG  must  never  overgeneralize  a  positive  example. 
This  is  a  strong  assumption,  but  it  does  noi  require 
the  domain  theory  to  be  complete  and  correct;  in  par¬ 
ticular,  the  domain  theory  might  produce  an  (incor¬ 
rect)  proof  for  a  negative  example  of  the  concept.  In 
other  words,  the  domedn  theory  can  be  overly  general 
with  respect  to  the  teirget  concept.  However,  the  the¬ 
ory  must  be  sufficiently  detailed  that  EBG  does  not 
overgeneralize  any  positive  examples. 

The  second  result  shows  that  some  of  the  require¬ 
ments  on  the  domain  theory  can  be  relaxed  by  using 
a  slightly  different  abstraction  function.  By  mapping 
an  instance  to  a  sei  of  explanation  structures,  rather 
than  to  a  single  explanation  structure,  a  somewhat 
weaker  requirement  is  placed  on  the  domain  theory. 
The  domain  theory  must  still  be  complete;  however, 
the  theory  may  produce  multiple  inconsistent  explana¬ 
tion  structures  for  each  instance,  rather  than  a  single 
explanation  structure.  The  theory  must  still  be  suffi¬ 
ciently  detailed  that  EBG  does  not  overgeneralize  on 
all  explanations  of  a  positive  example;  however,  EBG 
may  overgeneralize  on  some  of  the  possible  explana¬ 
tions  of  an  instance. 

The  second  result  is  also  of  interest  because  it  shows 
that  it  is  possible  to  relax  assumptions  about  the  do¬ 
main  theory  in  a  principled  way,  and  thus  to  expand 
the  range  of  competence  of  a  representation-shifting 
learning  system. 

5  Conclusions 

In  this  paper,  we  have  developed  a  framework  for 
knowledge-level  analysis  of  the  changes  in  representa¬ 
tions  of  training  examples  in  learning.  There  have  beta 
several  learning  systems  implemented  that  use  changes 
of  representation  of  the  kind  considered  in  this  paper; 
some  good  examples  are  [Cohen,  1990b]  and  the  sys¬ 
tems  discussed  in  [Flann  and  Dietterich,  1989].^  The 


’It  is  important  to  keep  in  mind  the  difference  be¬ 
tween  tlicse  systems  and  work  on  constructive  induction, 
for  example  [Drastal  ct  al.,  1989]:  constructive  induction 
is  concerned  with  information-preserving  transformations 
of  training  instances. 
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framework  was  used  to  uncover  the  assumptions  im¬ 
plicit  in  a  commonly-used  representation  shift,  and  to 
derive  a  representation  shift  which  is  appropriate  for 
a  weak  assumption  about  the  correctness  of  EBL.  The 
first  result  is  important  because  it  helps  us  to  under¬ 
stand  the  assumptions  made  by  the  learning  systems 
which  use  this  representation  shift;  the  second,  because 
it  shows  how  to  relax  these  assumptions  in  a  principled 
way. 

Most  prior  analyses  of  representation  change  have 
focused  on  automatic  reformulation  of  problem-solving 
strategies  [Amarel,  1981;  Korf,  1981:  Riddle,  logoj 
Subramanian,  ISSSj.  Some  of  the  formal  models  of 
representation  change  in  problem  solving,  however,  are 
general  enough  to  serve  as  a  model  for  representation 
change  in  learning.  In  particular,  our  model  of  repre¬ 
sentation  change  in  learning  and  notion  of  appropri¬ 
ateness  is  closely  related  to  the  models  presented  in 
[Holte  and  Zimmer,  1989]  and  [Lowry,  1988);  in  fact, 
it  is  a  special  case,  in  which  the  problem  to  be  solved  is 
required  to  be  a  learning  problem.  However,  the  earlier 
models  do  not  d  jvelop  the  notion  of  a  “most  appro¬ 
priate”  representation,  and  have  not  been  applied  to 
analysis  of  representation  shifts  in  learning. 

Work  in  representation  change  in  learning  has  gen¬ 
erally  focused  on  automatic  change  or  derivation  of 
inductive  bias  [Chrisman,  1989;  Keller,  1989;  Ruasello 
and  Groeof,  1989;  Utgoff,  1984].  The  representation 
changes  considered  in  this  paper  -  changes  in  the  rep¬ 
resentation  of  training  examples  -  are  more  fundamen¬ 
tal  than  shift  of  bias;  such  a  change  alters  the  very 
spewe  over  which  learning  occurs,  and  hence  necessi¬ 
tates  selection  of  a  new  hypothesis  space  and  (prob¬ 
ably)  a  new  leeirning  algorithm.  Our  goeJs,  however, 
are  more  modest:  the  goal  of  this  paper  is  to  provide 
a  framework  for  analysis  of  representational  choices 
made  by  humans,  and  to  suggest  a  methodology  for 
humans  to  follow  in  constructing  new  representational 
shifts,  given  background  knowledge  about  a  learning 
problem. 

These  goals  have  been  at  least  partially  met,  how¬ 
ever,  many  problems  remain  for  future  research.  One 
difficult  problem  is  automation  (or  partial  autom€i- 
tion)  of  the  methodology  suggested  for  constructing 
representational  shifts.  Another  set  of  problems  is  sug¬ 
gested  by  the  observation  that  there  is  a  second  pos¬ 
sible  interpretation  of  a  the  architecture  of  figure  1: 
Xi  might  be  the  space  in  which  training  examples  are 
“naturally  found”  (for  instance,  Xf  might  be  the  set 
of  all  soybean  plants).  In  this  case,  /r  is  a  noncom- 
putable  function  that  denotes  an  iniiial  choice  of  rep¬ 
resentations.  Our  basic  framework  applies  to  initial- 
choice  representations  as  well  as  to  representation- 
shifting  functions,  although  obviously  proof  techniques 
for  showing  the  appropriateness  of  initial-choice  repre¬ 
sentations  will  be  different  from  the  proof  techniques 
used  in  this  paper.  Finally,  the  definition  of  the  ap¬ 
propriateness  of  a  representational  shift  assumes  that 
the  goal  of  learning  is  exact  identification  of  the  target 
concept;  this  is  a  more  stringent  identification  criterion 
than  the  Valiant  criterion  of  probably  approximately 


correct  learnability  [Valiant,  1984],  which  requires  only 
approximate  identification  of  the  target  concept.  This 
suggests  that  a  probabilistic  extension  of  the  analytic 
framework  may  be  worth  investigating. 
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A  Proofs  of  theorems 

A.l  Proof  of  theorem  1 

If  Xt{x)  is  the  characteristic  function  of  T  on  a:  then 
it  can  be  easily  verified  that 

APPROPifA,V)  =  (3) 

WeP,Wx,yeXi,  f,. 

ifnix)  =  fniy)  =>  A't(x)  =  Xriy)) 

So  if 

A{P) 

APPROPifn^V) 

<=;>  APPROP{fR,,V) 

then  for  i  =  1,2 
'ix,y  e  Xiy 

fRM  =  fR,{y)  ^  (VTePA'T(®)  =  A'r(j/)) 
and  hence 

fRiix)  =  fji,{y)  <=:>  /r,(s)  = /i?,(y) 

and  the  mapping  g(x)  =  /«,(/«, ^(®))  is  well-defined 
and  one-to-one.  ■ 

A. 2  Proof  of  theorem  2 

To  prove  the  theorem,  it  is  necessary  to  show  both 
sides  of  the  logical  equivalence.  The  proposed  sim¬ 
plification  of  APPROP{fjit,V)  will  be  referred  to  as 
A{V). 

To  show  that  C/e  o  ts  appropriate  forV)  =>■  A{^). 
First,  note  that  if  /e,o  is  a  well-defined  function  over 
Xi,  then  the  unique  proof  assumption  holds.  If  /o,^ 
is  also  appropriate  for  V,  then 

VT  S  7>,  Va:,y,  -^{fQ.o{x)  =  fQ,o{y)^M=^)  5^  Mv)) 

again  using  Xx{x)  to  denote  the  characteristic  function 
of  T  on  X.  Now,  if  /©,o(j:)  =  fe,o{y)  then  there 
is  some  explanation  structure  e  such  that  f0,o{x)  = 
fo.oiv)  =  e,  and 

VT  G  T,Va;  G  T,  (eo(Px)  =  eo{Py)  y  €  T) 

=>  VT  G  -P,  Vx  G  T,  (y  G  EDG{e,  (?.  px)  =>  2/  £  TJS) 
=>  VT  G  P,  Vx  G  T,  EBGiQ,  O,  Px)  C  T 
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where  px  is  any  proof  of  s,  and  where  py  is  any  proof 
of  y;  hence  the  second  conjunct  holds.  Line  5  follows 
from  assumption  1. 

To  show  that  A{V)  =>  (fQ,o  is  appropriate  for  V ). 
First,  note  that  the  unique  proof  assumption  implies 
that  feo  is  a  function.  Second,  observe  that  for  any 
V, 

VT  €  P,  V2  €  T,  0,p^)  C  T) 

=>  VT  e  V,  \/x  P  T,  {EBG{Q, 0,p^)CT) 

where  Px  again  denotes  a  proof  of  x.  To  see  that  this 
is  true,  imagine  that  the  antecedent  holds  but 

3r  e  7?,  X  €  7,  p  €  PROOFSq{x)  : 
EBG{Q,0,p^)gT 

Then  there  is  some  y  6  EBG{Q,  0,pz)  —  T]  note  that 
this  implies  that  y  is  in  T.  But  by  assumption  1 

y  6  EBG{e,0,px) 

=t>  eo{py)  =  eo(px) 

=»  f?I?G(0,O,Py)  =  f?BG(0,O,Px) 

and  so 

3TeV,y€Ti  EBG{Qy  0,Py)^T 

a  contradiction.  The  preceding  argument  means  that 
AiV)  implies 

VT€P,  Vx,y€X/, 

(y  €  BJ3G(0,(P,Px)  Xri^)  =  A’T(y)) 

Since  by  assumption  1 

ye  BBC  :),0,x) 

<=>  eo(py)  ecj(px) 

<!=>  /0,o(®)  = /0,o(y) 
it  follows  that 
vrep,  Vx,y, 

{fo,o{x)  =  fQ,o{y)  ^  Xt{x)  =  Xriy)) 

Hence  /q^o  is  appropriate  by  equation  3.  ■ 

A.3  Proof  of  theorem  3 

Again,  both  sides  of  the  equivalence  must  be  shown. 

To  show  AB{V)  /q^o  is  appropriate.  The  con¬ 
trapositive  will  be  shown.  If  ^  is  not  appropriate, 
then 

ar  6  P,x,y  :  /0,o(x)  =  A  ^'^(x)  #  d'x(y) 

Assume  without  loss  of  generality  that  x  e  T  and  y  e 
T.  Now,  by  assumption  1,  if ^  fs.aiv) 

Vpx  e  PROOFS{x),y  e  EBGie,0,Pz).  But  then 

3T  €  y  €  ^  i 

Vpx  e  PROOFS{x),  y  e  bbg(0,o,px) 

contradicting  AB{V). 

To  show  f^  Q  is  appropriate  AB{V).  This  di¬ 
rection  of  the  proof  uses  the  easily-verified  proposition 
that  for  any  appropriate  representation  //}  for  P 

VT  G  T,  Vx  e  Xj,  (x  G  T  /^(x)  G  /^(T)) 


K/S.o  is  appropriate  for  P  then 

VT  G  P.Vx  G  Xi,  (x  G  T  =>  /^,o(x)  G  /^.o(T)) 

Now,  since  /©  o(T)  is  a  hitset,  there  is  some  Eq  such 
that  /e,o(T)  =  {E  :  3eo(p*,)  G  B  n  Bo}  and  so 
/0,o(®)  ^  ^  3eo(px,-)  G  /0,o(®)  0 

Let  e'  denote  the  expleination  structure  in  /0,o(-x)n 
En.  We  claim  that  any  y  that  has  a  proof  py  with  an 
explanation  structure  equivalent  to  c'  '.vill  be  in  T.  If 
this  is  true,  then  by  assumption  1,  BBG(0, 0,Px)  Q  T 
for  the  proof  px,'  of  x  with  explanation  structure  e'; 
and  since  nothing  has  been  assumed  except  that  x  G  T 
and  T  G  P,  AB{P)  will  be  established  and  the  proof 
will  be  complete. 

But  the  claim  must  be  true,  because  for  any  y, 

^  /o,o(y) 

c'  G  /0,£)(y)  HBo 

=^/0,o(y)e/0.o(r) 

ri^yeT 

the  last  step  following  because  Jq  q  is  appropriate.  ■ 
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Abstract 

Induction  of  action  sequences  in  a  simulated 
robot  world  is  described.  The  learner  pas¬ 
sively  observes  examples  which  are  procedu¬ 
ral  action  sequences  where  properties  and/or 
relations  change  in  the  simulation.  The  ob¬ 
served  procedures  are  split  and  trimmed  to 
form  the  initial  concept  descriptions.  Active 
experimentation  occurs  when  the  system  re¬ 
peats  or  tests  more  general  descriptions  of 
the  observed  examples.  Generalisations  are 
formed  by  constructive  induction  using  in¬ 
verse  resolution  and  tested  by  execution  of 
operational  actions  in  the  simulation.  Com¬ 
pleted  execution  of  the  test  confirms  success. 
Inability  to  complete  a  concept  in  the  sim¬ 
ulation  causes  search  for  successively  greater 
generalisation  on  demand  inorder  to  continue 
execution.  Background  knowledge  need  not 
be  present  as  the  system  invents  appropri¬ 
ately  justified  concepts  as  required. 

1  Introduction 

Before  children  learn  conventional  language  they  can 
learn  how  to  build  columns  and  arches  from  blocks, 
and  other  such  concepts.  They  do  this  by  observing  an 
agent  (maybe  a  parent)  perform  a  one  or  more  exam¬ 
ples  of  these  concepts,  and  then  imitating  the  observed 
actions.  Once  they  start  experimenting  they  can  some¬ 
times  proceed  and  successfully  learn  the  intended  con¬ 
cept,  or  some  variation  on  it,  without  feedback  from 
the  agent  (“playing  by  themselves”). 

In  a  simulated  world  containing  two  robots  we  ex¬ 
amine  this  type  of  learning  from  the  point  of  view  of 
the  child.  Thus  the  learning  task  is.  Given  an  intel- 
li^^..»t  agent  performing  examples  of  purposeful  action 
sequences.  Create  a  learning  element  which  observes 
these  examples  and  then  conducts  experiments  to  ac¬ 
quire  useful  concepts  by  exploring  the  environment. 

The  learning  of  procedures  from  examples  in  a  re¬ 
active  environment  has  received  some  attention  in. 
NODDY  {Andreae,  1985}  (learns  blocks  world  ac¬ 
tions),  SIERRA  {VanLehn,  1987}  (learns  how  to  sub¬ 
tract)  and  BAGGER2  {Shavlik,  1989}  (learns  wide 


variety  of  concepts:  build  column,  transform  logic  cir¬ 
cuit,  etc.).  However,  each  acquires  concepts  by  pas¬ 
sively  observing  examples  without  generating  experi¬ 
ments.  Active  learning  of  single  step  procedures  oc¬ 
curs  cooperatively  in  ADEPT  (experimentation)  and 
PIIINEAS  (concept  revision)  {Falkenhainer  &  Raja- 
money,  1988}  with  explanations  of  new  physical  pro¬ 
cesses.  But  in  all  these  systems,  as  for  EBL  in  general, 
extensive  domain  theory  must  be  supplied  or  built  in 
so  that  explanations  of  examples  may  be  analysed. 

In  planning  or  problem  solving,  two  learning  sys¬ 
tems  which  can  learn  temporal  concepts  interactively 
and  autonomously  in  an  environment  are  PRODIGY 
{Carbonell  &  Gil,  1988}  and  LIVE  {Shen  &  Simon, 
1989}.  PRODIGY  discovers  missing  conditions  on 
operators  as  well  as  their  subgoal  ordering  of  appli¬ 
cation,  and  LIVE  can  extend  its  concept  description 
(by  matching  domain  functions  over  the  whole  exper¬ 
iment  in  order  to  find  missing  aspects  of  the  concept 
description)  when  the  operators  fail  to  predict  an  out¬ 
come,  thus  causing  the  operator  to  split  on  the  pre¬ 
condition.  However,  both  these  systems  obtain  their 
guidance  from  a  predefined  goal  to  be  obtained,  and 
other  domain  knowledge. 

In  this  paper  we  describe  learning  of  a  set  of  con¬ 
cepts  representing  a  procedure.  This  set  of  concepts 
is  organised  as  a  hierarchy  with  higher  level  concepts 
referring  to  lower  level  concepts.  For  each  new  concept 
identifier  (supplied  wiih  examples)  a  hierarchy  is  built, 
and  when  learnt  can  be  used  to  recognise  an  example 
of  that  procedure,  or  can  be  executed  to  perform  that 
procedure.  The  seed  concept  is  the  unique  concept  at 
the  very  top  of  a  hierarchy.  Constructive  induction  is 
performed  on  different  parts  of  the  hierarchy  during  an 
experiment  to  perform  and  track  execution  of  the  seed 
concept.  Using  only  imitation  and  the  environment  as 
guiding  influences,  we  show  that  effective  learning  in 
the  absence  of  background  knowledge  can  occur. 

After  a  description  of  the  representation  and  termi¬ 
nology  used,  the  overall  learning  strategy  is  described. 
Then,  details  of  tracking  experiments  on  concepts,  and 
regeneralisation  on  failure  of  expected  outcomes,  is  fol¬ 
lowed  by  the  generalisation  mechanism.  After  this,  two 
traces  of  concepts  learnt  by  CAP  are  examined,  fol¬ 
lowed  by  discussion  of  the  system  learning  these  and 
other  procedures. 
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2  Learning  Procedures 

The  representation  used  for  concepts  is  a  subset  of 
Horn  clause  logic^  with  the  following  variations.  For 
the  purpose  of  describing  temporal  concepts  it  is  useful 
to  distinguish  two  semantic  interpretations  of  predi¬ 
cates:  state  predicates  p{ti, ...  ,tn, Time, Time),  indi¬ 
cate  that  p{ti,...,tn)  is  true  at  state  Time  (shown 
by  p{iif-,tn)  Time),  and  action  predicates 
p{ti,. ..  ,tn.  Start,  Finish),  indicate  that  p(<i,. 
is  true  during  the  states  Start  until  Finish  inclusive 
(shown  by  p(ti . t„)  during  Start  /  Finish).  Pos¬ 

sible  interpretations  for  ^4  *-  B,C  are  “if  B  and  C 
are  recognised,  then  we  recognise  A  as  well”  and  “to 
perform  A  then  one  must  perform  B  and  C”. 

The  components  of  the  system  and  the  infor¬ 
mation  flow  between  them  is  shown  in  Figure  1. 
CAP  is  connected  to  a  simulation  which  receives  prim¬ 
itive  actions  to  be  performed  and  produces  predicates 
describing  properties,  relations,  and  actions  for  the 
next  state.  The  system  starts  by  observing  (ie.  record¬ 
ing)  all  primitive  predicates  produced  by  the  connected 
simulation  during  execution  of  an  example  provided  by 
an  external  agent.  This  is  specifled  as  a  list  of  actions 
on  objects,  and  is  performed  by  the  learner  in  the  sim¬ 
ulation.  Neither  the  substructure  of  this  example  nor 
the  objects  relevant  to  the  intended  concept  to  be  ac¬ 
quired  are  given,  merely  an  observed  sequence  of  states 
and  actions  with  possibly  many  irrelevant  objects.  The 
intelligent  agent  that  provided  the  example  can  never 
give  feedback  to  the  learner  as  to  the  nearness  to  the 
intended  target  concept.  Thus  the  example  is  referred 
to  as  an  instance  of  a  seed  concept,  ie.  one  to  stimulate 
and  direct  search  for  useful  concepts. 

In  order  to  create  the  initral  procedure  description 
the  complete  observed  example  trace  is  divided  by  two 
levels  of  division  and  “irrelevant”  predicates  trimmed 
out.  A  concept  hierarchy  with  one  “root”  is  then  gen¬ 
erated  with  all  constants  automatically  variablised.  As 
we  are  concerned  with  the  active  experimentation  in 
this  paper,  we  briefly  sketch  this  process. 

Major  divisions  occur  when  primitive  actions  cease 
acting  on  one  object  set  and  start  acting  on  a  different 
set  of  objects.  Minor  divisions  occur  when  one  primi¬ 
tive  action  ceases  on  an  object  set  and  a  different  prim¬ 
itive  action  starts  on  the  same  object  set.  Trimming 
chooses  the  most  changing  object  and  objects  from 
one  or  two  next  level  sets  of  object-changing-predicate 
counts.  This  is  a  generalisation  as  conjoined  predi¬ 
cates  are  removed  thus  increasing  the  coverage  or  use 
of  the  resulting  description.  The  justification  for  this 
comes  from  a  heuristic  of  “localised  focussing”.  Just  as 
a  child  focusses  on  a  subset  of  all  the  objects  available 
in  order  to  build  an  arch,  so  does  CAP. 

Example  1 

For  a  seed  called  do_a.pour  the  observed  sequence  is 
pouring  water  from  a  full  cup  to  an  empty  cup: 

cup(a)  at  1,  cup(b)  at  1,  cylinder(c)  at  1,  bowl(e)  at  1, 

contains-liquid(a)  at  1,  not-contains-Iiquid(b)  at  1, 
not-contains-liquid(c)  at  1,  not-contains-liquid(e)  at  1, 
pour(a,  b)  during  1/2, 

“The  reader  is  referred  to  {Muggleton&Buntine,  1988). 


cup(a)  at  2,  cup(b)  at  2,  cylindcr(c)  at  2,  bowl(e)  at  2, 
not-coiitain9-liquid(a)  at  2,  ccntains-liquidfb)  at  2, 

not-containa-liquid(c)  at  2,  not>contain>-liquid(e)  at  2 

The  trivial  seed  concept  hierarchy  from  this,  prior  to 
all  constant  terms  replaced  by  variable  terms,  is 

Sorted  count  list  [(b  ,  3),  (a  ,  3)] 

Trimming  to  objects:  [b,  a].  Each  state  trimmed  from  8  to  4 
do-a-pour(B,  b)  during  1  /  i  — 

sdo-a-p3ur(b,  a)  during  t  /  2 
8do-a-pour(b,  a)  during  1  /  2 

ssdo-a-pour(b,  a)  during  1/2 
sido-a-pour(b,  a)  during  1  /  2 
cupfa)  at  1, 
cup(b)  at  1, 
containt-liquid(a)  at  1, 
not>contains>iiquid(b)  at  1, 
pour(a,  b)  during  1/2, 
contains-liquid(b)  at  2, 
not-contains*Iiquid(a)  at  2, 
cup(a)  at  2, 
cup(b;  at  2 

If  at  some  later  stage  (maybe  after  some  learning) 
more  examples  are  observed,  they  are  attempted  to 
be  recognised  with  the  current  hierarchy.  PYom  this, 
intra-construction  forms  a  new  ‘root’  based  on  a  sim¬ 
plified  maximal  su’osequence  match^. 

A  cautious  leainer  in  an  unknown  environment  will 
not  do  the  most  radical  changes  to  its  current  con¬ 
cept  description  and  hope  for  success.  Identification 
of  the  cause  of  a  failure  is  easier  with  smaller  numbers 
of  simultaneously  tested  changes.  During  execution 
of  a  concept  hierarchy  many  concept  bodies  are  being 
executed.  With  generalisations  possible  on  ail  these 
bodies  the  amount  attempted  must  be  strictly  con¬ 
trolled.  To  this  end  generalisation  levels  are  defined 
ordered  on  the  least  amount  of  new  coverage  added  to 
memory,  and  on  minimal  invention  of  new  symbols. 

After  the  examples  are  read  in  to  form  the  initial 
hierarchy  the  system  attempts  to  learn  a  more  general 
hierarchy  using  Algorithm  1.  But  first  the  seed  con¬ 
cepts  are  assigned  initial  generalisation  levels  of  repeat 
which  force  an  initial  execution  of  the  concept  without 
attempting  any  generalisations. 

Algorithm  1  Learn 

1.  Find  a  seed  concept  S  B  that  has  not  stopped 
testing,  with  the  lowest  generalisation  level  L.  If 
there  is  more  than  one  seed  concept  at  this  level 
then  choose  the  one  with  the  least  recent  exper¬ 
iment.  Bind  the  starting  state  term  of  S  to  the 
current  simulation  state  number. 

2.  Call  track(5,  L)  which  executes  the  whole  proce¬ 
dure  (ie.  concept  hierarchy)  by  executing  a  body 
of  S  which  in  turn  executes  each  of  the  compo¬ 
nents  of  that  body  and  so  on.  The  generali.sation 
level  L  represents  the  maximum  generalisation  to 
attempted  on  any  body  executed. 

3.  If  track  exits  then  S  was  performed  successfully,  so 
the  generalisations  performed  on  any  part  of  the 
hierarchy  during  execution  (if  any)  are  adopted. 
If  no  generalisation  was  tested  then  the  learner 
trie'>  being  more  adventurous  by  increasing  the 

*These  aspects  of  example  observation  along  with  other 
details  are  described  in  detail  in  {Hume,  1990} 
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Figure  1;  System  Diagram  and  Informid'nn  Flo’.v 


generalisation  level  of  this  seed  concept.  If  any 
generalisation  required  invention  of  a  new  concept 
then  restart  any  “stopped  testing”  seed  concepts. 
Continue  at  Step  1. 

4.  If  track  fails  and  the  mistrack  occurred  in  the  very 
last  state,  then  we  can  be  sure  of  exactly  what 
generalisation  on  a  concept  body  caused  the  fail¬ 
ure.  Record  this  as  a  failed  generalisation  (a  gen¬ 
eralisation  never  to  be  attempted  again)  in  that 
concept.  Continue  at  Step  6. 

5.  Otherwise,  the  track  failed  and  we  can  not  iden¬ 
tify  the  exact  cause.  Mistracking  could  not  be  cor¬ 
rected  by  the  generalisation  level  so  regeneralise 
by  temporarily  increasing  the  generalisation  level^ 
and  continuing  at  Step  2  (with  original  starting 
state  number).  If  we  cannot  temporarily  increase 
the  level  then  proceed  at  Step  6. 

6.  Because  the  track  has  failed  then  the  learner  tries 
being  lucky  (or  more  desperate)  looking  for  a 
more  general  concept  that  may  be  testable  in  the 
world,  by  increasing  the  generalisation  level.  But 
if  there  are  no  more  generalisation  levels  then  ex¬ 
perimentation  stops  (marking  this  seed  concept  as 
“stopped  testing”)  until  new  concepts  arise  due  to 
other  seed  concepts.  Continue  at  Step  1. 

Failure  of  experiments  is  critical  to  specific-to- 
general  learning,  otherwise  the  concepts  would  never 
stop  being  generalised.  In  Step  4  of  the  above  algo¬ 
rithm  the  experiment  fails  in  the  very  last  state  of  the 
seed  concept.  If  no  generalisation  was  tested  (as  in  a 
repeat  experiment)  then  the  existing  concept  descrip¬ 
tion  is  incorrect  and  thus  a  previous  generalisation  was 
wrong.  Because  the  current  set  of  generalisations  on  a 
concept  have  each  generalisation  applied  and  tested  at 
a  different  time,  then  identification  of  the  exact  gen¬ 
eralisation  causing  this  failure  is  impossible.  Thus  all 
generalisations  ever  conducted  on  the  concept  are  un¬ 
done  and  the  conjunction  of  them  all  is  recorded  as  a 
failed  generalisation. 

Similarly,  when  a  current  generalisation  being  tested 
is  deemed  to  be  responsible  for  the  failure  it  is  the 
conjunction  of  all  the  generalisations  that  has  failed, 
and  not  allowed  to  be  tested  again. 


^The  ordering  is  repeat,  absorb  and  intra-construct. 


TVacking  a  predicate  attempts  to  complete  execution 
of  that  predicate  by  either  showing  that  it  exists  in  the 
world  (for  primitive  state  predicates),  by  performing 
that  predicate  in  the  simulation  (for  primitive  action 
predicates),  or  by  finding  the  body  of  that  predicate 
from  the  concept  hierarchy  and  executing  that  body 
or  an  attempted  generalisation  of  that  body: 

Given  a  partially  grounded  predicate  P  and  a  re¬ 
quested  generalisation  level  L,  we  generalise  and  track 
the  predicate  using  Algorithm  2: 

Algorithm  2  Track(P,  L) 

•  If  P  is  primitive  then  completed  execution  occurs 
if  P  e  worid  W,  or  P  can  be,  and  is,  performed 
in  the  simulation.  (The  performance  will  add  P 
and  the  simulation’s  succeeding  state  predicates 
to  W.)  Fail  otherwise. 

•  For  P  non-primitive,  on  it  attempt  to  find  an  ex¬ 
periment  test,  T,  for  a  generalisation  of  the  re¬ 
quested  level,  generalise(L,  P,  T)  (described  in 
Algorithm  3,  below),  or  fail. 

With  the  returned  experiment  body  T,  track  each 
of  the  component  predicates  Pj  €  T,  in  order: 
track(P,-,L).  Succeed  if  all  components  complete 
execution,  otherwise  backtrack  to  generalise(L, 
P,  T),  to  possibly  produce  a  new  experiment  the 
components  of  which  may  succeed. 

When  a  concept  learner  has  too  specific  a  concept 
description,  it  must  generalise  that  description  to  v.  ver 
more  positive  examples  while  still  excluding  negative 
examples.  We  have  just  seen  how  a  test  of  the  new  de¬ 
scription  created  is  verified.  Now,  we  describe  how  to 
create  experiments  and  how  to  repair  ones  that  fail,  so 
that  execution  of  the  overall  hierarchy  may  continue. 

Given  a  requested  generalisation  level  L  and  a  par¬ 
tially  grounded  predicate  P,  we  specify  how  to  return 
a  set  of  predicates  T,  representing  a  test  of  a  general¬ 
isation  of  P  with  Algorithm  3: 

Algorithm  3  GeneraIise(X,  P,  T) 

1.  Since  P  is  guaranteed  to  be  a  non-primitive  pred¬ 
icate  find  a  matching  concept  P  ♦—  C  in  memory, 

2.  Find  an  experiment  of  the  requested  generalise 
tion  level  i,  for  P  on  C  returning  the  test  of  that 
experiment,  T,  and  a  set  of  generalisations,  G,  be¬ 
ing  tested,  ensuring  that  the  union  of  G  and  the 
existing  generalisation  set  on  C  is  not  in  the  failed 
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generalisations  of  C.  Failing  this,  attempt  to  find 
the  most  minimal  generalisation  experiment  pos¬ 
sible,  up  to  level  L.  Otherwise  try  to  find  another 
body  C  in  Step  1  and  continue  there,  or  fail. 

3.  If  the  experiment  test,  T,  fails  to  track  then  re- 
generalise  by  trying  to  find  a  different  experiment 
test,  T',  for  a  consistent  generalisation  on  C.  If 
this  cannot  be  found,  try  to  find  another  body  C 
in  Step  1  and  continue  there,  or  fail. 

The  key  to  determining  success  of  experiments  of 
generalisation  tests  is  a  conAtaicnl  jjcheralication  A 
consistent  generalisation  exists  on  a  concept  body,  C, 
when  the  generalisation  list,  GL,  for  C,  containing  suc¬ 
cessively  applied  generalisation  sets,  has  each  succeed¬ 
ing  generalisation  set  a  time  invariant  subset  (a  subset 
up  to  renaming  of  time  stamps),  of  the  preceding  gen¬ 
eralisation  set  in  the  list. 

The  constructive  induction  used  for  generalisation 
is  absorption  and  intra-constmeiion  as  described  in 
{Sammut,  1981,  Sammut  &  Banerji,  1986,  Muggleton 
&  Buntine  1988},  except  that  ideal  tests  of  generalisa¬ 
tions  have  been  modified  to  generate  performable  tests 
given  the  current  state  and  execution  of  the  simulated 
world.  The  problem  is,  how  can  we  find  a  generalisar 
tion  on  a  concept  body  of  a  predicate  being  tracked 
which  can  be  tested  given  the  current  execution  of  the 
world?  The  answer  depends  on  how  much  the  world 
and  the  concept  body  must  match,  and  this  possible 
restriction  must  be  taken  into  account  when  deciding 
what  generalisation  is  best  to  test.  Tests  of  requested 
generalisations  are  constructed  on  an  existing  concept 
P  *-C  in  the  following  manner: 

L  =  repeat 

No  generalisation  is  to  be  tested,  G  =  0.  Return 
a  test  equal  to  the  concept  body,  T  =  {C}. 

L  =  absorb 

If  we  think  of  C  as  composed  of  two  parts  Bi 
and  B2  and  we  have  a  concept  H  *-  B2,  then 
absorption  replaces  the  existing  concept  by  P  <— 
H,Bi.  The  generalisation  set  G  is  {//  <—  B2}, 
and  the  test  of  this  absorption  is  T  =  {Tj.Bi} 
where  H  <—Ti  is  another  concept. 

L  =  intra-construct 

For  intra-construction,  we  have  a  second  existing 
concept  P  «—  B2,B3,  and  the  two  concepts  are 
replaced  hy  P  *—  H,  B2  where  /f  is  a  new  pred¬ 
icate.  Also,  two  auxiliary  concepts  are  invented, 
H  <—  Bi  and  H  *—  B3.  The  generalisation  set  G 

\s  in  *—  and  the  taat,  i«!  simnlv  f.ha  carond 

V  *  J  ^  - - - 

concept  body,  T  =  {B2,B3}. 

In  summary,  two  abstract  influences  guide  the  learn¬ 
ing  process.  At  a  high-level  regeneralisation  represents 
a  tendency  in  the  learner  to  rethink  a  mistrack  of  one 
more  general  concept,  as  success  of  a  different  more 
general  concept.  The  low-level  influence  is  consis¬ 
tency  recognising  justifiably  acceptable  completed  ex¬ 
ecutions  of  actions  performed  in  the  simulation  given 
that  certain  generalisation  tests  are  current. 


3  Examples 

Two  different  sessions  are  presented  here  to  illustrate 
aspects  of  GAP’s  learning  behaviour.  The  first  illus¬ 
trates  the  complicated  hierarchial  nature  of  the  con¬ 
cept  learning  on  a  single  seed  concept,  belayed-climb. 
The  second  displays  the  autonomous  discovery  aspect 
while  learning  two  different  seed  concepts,  do.a.pour 
and  do-a-f illfromtap  alternately  in  the  same  session. 

In  the  domain  of  mountaineering,  the  world  simula¬ 
tion  provides  primitive  properties  (person,  caap-site, 
summit,  rope.attached,  nojrope^ttached),  relations 
(above,  level.9bove,  at jTope.ends,  not.at jrope.ends), 

and  actions  (clinbup,  climbdown  and  sleep.ovemight). 
The  first  example  shown  to  CAP  starto  in  state  1  with 
persons  a  and  b  attached  to  the  rope  and  level  with 
camp  site  campO,  and  ceuipl  three  rope  lengths  above 
campO.  Climber  a  lead  climbs  up  until  reaching  the 
rope  end,  and  stops.  He  then  belays  b  up.  Immedi¬ 
ately  b  leads  up  until  the  rope  end  is  reached,  then  he 
stops.  Next,  b  belays  a  up  and  when  a  leads  through 
and  reaches  the  rope  end  he  is  level  with  campl.  Fi¬ 
nally  b  is  belayed  up  until  level  with  a  and  the  camp 
site,  campl.  The  grounded  (prior  to  variablisation)  seed 
concept  produced  from  this  observation  is^: 

bela>'ed.climb(campO,  b,  a,  campl)  during  1  /  7 

one-move-climb-from-camp(a,  b,  campO)  during  1/2, 
two-move-climb-from-camp(b,  a,  campO)  during  2/4, 
two-move-cllmb-to-camp(a,  b,  campl)  during  4/6, 
one.move-cIimb-to-camp(b,  a,  campl)  during  6/7 

The  remainder  of  the  hierarchy  is  not  shown  here  for 
lack  of  space.  The  second  example  is  the  same  ex¬ 
cept  that  the  two  camps  happen  to  be  an  extra  rope 
length  apart  in  height.  This  means  that  a’s  second 
climb  both  starts  and  finishes  not  level  with  a  camp®. 
Because  part  of  the  second  example  shown  to  CAP  ex¬ 
actly  matclies  (ie.  is  recognised  by)  part  of  the  first  ex¬ 
ample  then  the  grounded  seed  concept  belayed-climb 
description  is  generalised  by  intra-construction  to: 

belayed-climb(campO,  b,  a,  campl)  during  9  /  17 

one-move-cIimb-from-camp{a,  b,  campO)  during  9  /  10, 
two-move-climb-from-campfb,  a,  campO)  during  10  /  12, 
two-move-or-move-to-camp(campl,  a,  b)  during  12  /  17 

with  the  new,  invented  concepts  automatically  justi¬ 
fied  by  the  examples  to  produce  the  following  grounded 
(prior  to  variablisation)  description: 

two-move-or-move-to-cBmp(campl,  a,  b)  during  12  /  17 

two-move-climb  to-camp(a,  b,  campl)  during  12  /  #8, 
one-move-climb-to-camp(b,  a,  campl)  during  #8  /  17 
two-move-or-rnove-to-camp(campl,  a,  b)  during  12  /  17 
two-move(a,  b)  during  12  /  14, 

two-move  climb-to-camp(b,  a,  campl)  during  14  /  16, 
one-move- Lliml'>-to-camp(a,  b,  campl)  during  16  /  17 

With  the  camps  nosv  yet  another  rope  length  apart, 
learning  coiiiinences  .-.ith  the  system  merely  attempt¬ 
ing  to  perform  a  repeat  experiment  from  state  18.  As 
the  objects  involved  in  the  experiment  are  unknown, 
the  arguments  to  belayed-climb  are  variables.  The 
trace  of  the  call  to  belayed-climb  follows: 

IVacking  'belayed-climb(CampX,  Personl,  Person2,  CampY )  during 
18  /  EndTime’  at  a  requested  generalisation  level  of  ’repeat' 

■*  Variables  are  shown  by  leading  uppercase  characters, 
and  constants  by  leading  lowercase  characters  or  numerals 
®This  is  why  a  camp  doesn’t  appear  in  twojnove. 
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To  track  and  execute  this,  CAP  tracks  each  of  the 
components  of  the  body  of  belayecLcliob  given  the 
starting  state  18: 

tracking  one-move-climb-from-camp(Pcrsonl,  Persons,  CampX) 
during  18  /  T1 

tracking  two-move-climb-froni-camp(b,  a,  campO)  during  19  /  T2 

After  execution  of  onejnove.climbJroD.camp  some  of 
the  variable  terms  have  been  unified  with  constants: 
Personl/b,  Per8on2/a,  CtunpX/caunpO,  and  Tl/19  but 
CampY/CanpY  is  still  variable.  Now  the  invented  con¬ 
cept  tvojnove.orjnove.to.camp  is  tracked  by  trying  the 
first  disjunct  body  tvojnove.climb.to.camp  to  see  if  a 
can  climb  straight  to  the  unknown  camp  CampY: 

tracking  two-  move-or-move-to-camp(CampY,  a,  b)  during  21  /  Tl 
tracking  two-move-climb-to-camp(a,  b,  CampY)  during  21  /  T2 

This  fails  since  the  camp  is  still  two  rope  lengths  above 
a.  CAP  then  retries  tracking  the  second  disjunct  body 
of  twojnove.orjnove.to.caap  as  shown  below: 

tracking  two-move-or-move-to-camp(CampY,  a,  b)  during  21  /  Tl 
tracking  two-move(a,  b)  during  21  /  T2 

tracking  two-move-climb-to-camp(b,  a,  CampY)  during  23  /  Tl 
Do;  c!imbup(b)  during  24  /  25,  failure. 

::  Regeneralising  •  backtrack  looking  for  a  successful  retry 

but  finds  that  after  those  two  climbing  moves  by  a, 
and  another  two  moves  by  b,  that  b  is  not  level  with 
the  unknown  camp  either.  Thus  a  repeat  of  the  exist¬ 
ing  concept  will  not  allow  the  procedure  to  complete 
execution.  By  increasing  the  requested  generalisation 
level  to  absorb,  the  learner  can  try  to  find  such  a  gen¬ 
eralisation  to  enable  completed  execution  of  the  belay 
climbing  procedure.  An  absorptive  generalisation  is 
found  on  the  bodies  of  tvojaove.orjiiove.to.camp  that 
creates  the  following  test  of  that  absorption: 

two-move(a,  b)  during  21  /  T2, 
two-move(b,  a)  during  T2  /  T3, 
two-move.climb-to-camp(a,  b,  CampY)  during  T3  /  T4, 
one-move-climb-to-camp(b,  a,  CampY)  during  T4  /  Tl 

This  test  requires  another  twojaove  to  be  executed 
by  b,  thus  matching  the  actual  fact  that  he  was  not 
at  a  camp  at  the  end  of  his  second  climb'.  Thus 
twojnoveCb,  a)  during  23  /  2S  exits,  execution  con¬ 
tinues  with  tsojnove.cllnb.to.ceunp  and  a  finally  arriv¬ 
ing  level  with  campl  in  state  27: 

tracking  two-move-climb-to-camp(a,  b,  CampY)  during  25  /  T4 
tracking  one-move-climb-to-camp(b,  a,  campl)  during  27  /  Tl 


Finally  the  procedure  belayecLclimb  exits  after 
having  to  allow  an  absorptive  generalisation  on 
twojBove.orjnove.to.camp  in  order  for  the  procedure  to 
complete  execution: 

Tracked  success  for  ’belayed-climb(campO,b,a,campl)  during  18/28’ 
at  requested  level  'repeat'  and  actual  level  'absorb'  taking  28.33  secs 


Because  of  the  success,  any  generalisations  are  con¬ 
firmed,  including  the  absorption  and  a  number  of 
unplanned  low-level  intra-constructions  on  bodies  of 
tBojttove  and  others  (not  shown  here).  Two  of  the 
completed  grounded  (prior  to  variablisation)  new  con¬ 
cept  descriptions  are  shown  for  brevity  (stwojnove  is  a 
recursive  sub-concept  of  twojnovo): 


two-move-or-move-to-camp(campl,  a,  b)  during  21  /  28  .- 
two-move(a,  b)  during  21  /  23, 

twt>-move-or-move-to-camp(campl,  b,  a)  during  23  /  28 
stwo-move(b,  a)  during  23  /  25 

symmet-at-rope-ends(a,  b)  at  23, 


rope-attached(b)  at  23, 
rope-attached(a)  at  23, 
person(b)  at  23, 
person(a)  at  23, 
above(B,  b)  at  23, 
climbup(b)  during  23  /  24, 
stwo.move(b,  a)  during  24  /  25 

and  the  two  grounded  invented  concepts: 

symmet-at-rope-endsfa,  b)  at  23  —  at-rope-ends(b,  a)  at  23 
symmet>at-rope-ends(a,  b}  at  23  at-rope-ends(a,  b)  at  23 
symmet-not-at-ropc-cnds(a,b)  at  24  not-at-rope-cnds(b,a)  at  24 
symmet-not-at-rope-ends(a,b)  at  24  not-at-rope-endsfa.b)  at  24 

Thus  by  observing  two  examples  of  a  belayed  climb 
from  one  camp  to  another,  and  then  faced  with  a  sit¬ 
uation  that  exact  imitation  would  fail  to  succeed,  the 
learner  has  acquired  the  ability  to  climb  any  two  per¬ 
sons  from  one  camp  to  another  camp  in  safety. 


The  second  example  involves  concepts  of  water 
transfer  and  containment.  Here,  the  world  simula¬ 
tion  provides  primitive  properties  (cup,  bowl,  cylinder, 
containsJliquid  and  not.containsJ.iquid)  and  primi¬ 
tive  actions  (fillfromtap  and  pour).  The  initial  state 
consists  of  four  objects,  two  cups,  a  closed  cylinder, 
and  a  bowl.  The  first  cup  contains  water. 

The  first  seed  example  is  the  single  action  pour  (a, 
b)  during  1/2  pouring  the  contents  of  cup  a  to  cup 
b.  The  initial  seed  concept  hierarchy  from  this  obser- 


a  b  c 


Figure  2:  World  simulation  in  state  2 


vation  is  given  in  Example  1.  The  second  seed  con¬ 
cept  example  do.a.f  illfrontap  consists  of  the  one  ac¬ 
tion  fillfroatap(a)  during  2/3,  filling  cup  a  from 
an  invisible  tap.  The  second  grounded  seed  concept 


a  b  c  e 

Figure  3:  World  simulation  in  state  3 
hierarchy  is  given  by  : 


Sorted  count  list  [(a  ,  3)] 

Trimming  to  objects:  [aj.  Each  state  trimmed  from  8  to  2 
do-a-flllfromtBp(B)  during  2  /  3 

8do-B-fiilfromtap(B)  during  2/3 
sdo-a-fillfromtap(a)  during  2  /  3 

ssdo-a-fiilrromtap(a)  during  2/3 
ssdo-a-flMfromtap(a)  during  2  /  3 
nnt-rontains-liniiidfa)  at  2, 
cup(a)  at  2, 

fitlfromtBp(B)  during  2/3, 
contBin8-Uquid(a)  at  3, 
cup(a)  at  3 


Now  that  these  two  seed  concept  hierarchies  have  been 
created,  the  program  commences  the  learning  phase. 
Randomly,  do.a.pour  is  attempted  to  be  proved  in  the 
world  with  an  initial  requested  generalisation  level  of 
repeat,  (ie.  no  generalisation  attempted)  and  the  start¬ 
ing  state  term  bound  to  the  cur-ent  state  number 
3.  Thus,  the  body  of  S8do.a.pour  has  a  substitution 
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are  variablised  and  stored  even  though  the  generali¬ 
sation  for  which  they  were  constructed  failed.  This 
is  to  prevent  the  same  intra-construction  from  being 
repeated  in  the  future. 

The  failure  means  that  this  seed  concept’s  general¬ 
isation  level  is  reset  to  repeat  to  rebuild  confidence  in 
the  current  description.  As  this  level  is  less  than  that 
of  do-a.pour  then  a  repeat  experiment  fillfrontapCb) 
during  11/12  is  attempted  on  doJiJillfromtap.  This 
completes  successfully,  increasing  the  generalisation 
level  to  absorb.  As  this  is  still  lower  than  the  level 
of  doji-pour  then  the  experimentation  continues  with 
the  same  seed  concept. 

Although  an  absorption  can  be  tested,  it  is 
equal  to  the  only  member  of  the  failed  generalisar 
tions  of  8Bdo.aJilllrointap.  A  repeat  experiment 
fillfrontap(b)  during  12/13  is  thus  conducted.  The 
successful  completion  would  increase  the  generalisa¬ 
tion  level  to  intra-construct  but  as  this  level  is  not 
allowed  do.aJiHfrontap  is  marked  as  stopped.  Ex¬ 
perimentation  continues  with  do-a.pour  requesting  an 
intra-construction. 

An  existing  concept  cuporcylinder  means  that  an 
absorption  v;ould  be  planned  if  the  cylinder  was  cho¬ 
sen  as  the  destination.  Thus  the  bowl  is  chosen  for 
the  planned  intra-construction.  Pouring  from  cup  to 
bowl  is  performed  pour(b,e)  during  13/14.  The  re- 


a  b  c  e 


Figure  8:  World  simulation  in  state  14 
suiting  state  does  not  match  the  test  but,  regenerali¬ 
sation  finds  a  consistent  generalisation  on  the  the  body 
equal  to  the  restamped  initiating  generalisation.  The 
new  grounded  intra-constructed  concepts  are: 

concave-u[j(e)  at  13  jcup(e)  at  13] 
concave-up(e;  at  13  (bowl(e)  at  13] 


The  success  means  that  this  seed  concept’s  gener¬ 
alisation  level  is  reset  to  repeat,  and  any  “stopped 
testing”  concepts  unmarked,  because  a  new  concept 
was  invented  which  may  mean  that  other  generali¬ 
sations  may  now  be  testable.  As  this  level  is  less 
than  that  of  do.aj(illfroBtap  then  a  repeat  exper¬ 
iment  is  now  attempted  on  do..a.pour:  The  system 
selects  the  first  two  cups  for  the  repeat  experiment 
pour (b, a)  during  14/15.  The  fact  that  they  happen 
to  be  both  empty  satisfies  the  concept.  After  pour- 
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Figure  9:  World  simulation  in  state  15 
ing,  there  is  a  test  predicate  not  satisfied  by  the 
world,  containsJLiquidCa)  at  15  and  a  world  predi¬ 
cate  not  covered  by  the  test,  not.contain8J.iquid(a) 
at  15.  Thus  the  experiment  must  fail.  As  no  gen¬ 
eralisation  was  tested  we  know  that  some  subset  of 


the  current  generalisations  is  incorrect.  As  we  do  not 
know  which  one,  the  conjunction  of  all  the  current  gen¬ 
eralisations  on  the  body  is  recorded  as  a  failed  gen¬ 
eralisation.  We  also  undo  those  generalisations  thus 
unwinding  the  body  back  to  the  originally  observed 
example.  The  unsuccessful  completion  of  this  experi¬ 
ment  increases  the  generalisation  level  to  absorb  in  the 
hope  that  a  more  general  test  may  possibly  succeed. 

Although  the  other  seed  concept  do.a-fillfroatap 
is  at  a  level  of  absorb  experimentation  remains  with 
doji.pour  due  to  a  delay  in  restarting  stopped  con¬ 
cepts.  In  this  situation,  two  simultaneous  absorptions 
are  found  on  different  parts  of  the  restricted  body  - 
one  testing  a  change  in  the  source  object  from  the 
cup  to  a  bowl  and  the  other  testing  a  change  in  the 
destination  object  from  a  cup  to  a  cylinder.  After 
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Figure  10:  World  simulation  in  state  16 

pour(e,c)  during  15/16  there  is  a  test  predicate  that 
is  not  satisfied  by  the  world  containsAiquidCc)  at 
16  and  a  world  predicate  not  covered  by  the  test, 
not.containsJ.iquid(c)  at  16.  Thus  the  experiment 
must  fail,  with  the  generalisation  marked  as  a  failed 
one.  The  generalisation  level  does  not  increase  since 
some  knowledge  (a  failed  generalisation)  was  learned. 

Experimentation  alternates  to  do.aJillfrontap  at 
the  same  level  of  absorb.  The  only  predicate  in  the 
starting  state  of  ssdo.a^illlromtap  that  can  have 
an  absorption  constructed  on  it  is  cup.  As  there 
is  a  current  generalisation  Baybe.containBJ.iquid  on 
that  restricted  body  and  the  combination  of  this 
and  cylinder  from  cuporcylinder  is  marked  as  a 
failed  generalisation,  then  we  try  the  combination  of 
Daybe.containBJ.iquid  and  bovl  from  concave.up  as  a 
planned  absorption.  Here,  a  following  generalisation  is 
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Figure  11:  World  simulation  in  state  17 


required  which  is  consistent  with  the  initiating  general¬ 
isation,  and  therefore  this  is  a  successful  experiment, 
and  the  generalisation  level  is  unchanged.  The  new 
grounded  concept  description  is: 


ssdo-a-filirrotntap(e)  during  16  /  17 
concave-up(e)  at  16, 
maybe-contains-liquid(e)  at  16, 
fillfromtap(e)  during  16  /  17, 
concave-up(e)  at  17, 
contains-Iiquid(e)  at  17 
ElabJustifications: 

initiating(maybe-contains-Iiquid(c)  at  16, 

{not-contains>liquid(e)  at  16],  {contains>Iiquid(e)  at  16]), 
initiating(concave-up(e)  at  16,  [cup(e)  at  16],  [bowl(e)  at  16]), 
folIowing(concave-up(e)  at  17,  (cup(e)  at  17],  (bowl(e)  at  17]) 
FbiledGenii; 

[maybe'Contains-liquid(e)  at  16  -  [contains-liquid(e)  at  16], 
cuporcylinder(e)  at  16  [cylinder(e)  at  16)] 
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searched  for  that  requires  no  generalisation  on  that 
body.  Unfortunately  this  is  not  possible  as  both  the 
cups  are  full  of  water  (the  observed  example  had  one 
cup  full  and  one  empty).  Thus  a  minimal  unplanned 
generalisation  is  found  by  intra-construction  cover¬ 
ing  the  unmatched  predicate  not.contain8J.iquid(b) 
at  3.  This  represents  a  test  of  whether  or  not  the 
destination  cup  need  be  empty  at  the  start  of  the  ex¬ 
periment.  Thus  the  primitive  action  pour(a,b)  during 
3/TEnd  is  performed  in  the  simulation.  The  succeeding 
state  Figure  4  is  as  predicted  by  the  experiment  and 
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Figure  4;  World  simulation  in  state  4 
thus  the  generalisation  is  adopted.  The  new  grounded 
intra-constructed  concepts  are: 


maybe-cont&in8oliquid(b)  at  3  [not*contain»*Hquid(b)  at  3] 
maybe*contain««Hquid(b)  at  3  [containft*liqutd(b)  at  3] 


The  generalisation  level  for  do.aq)our  does  not  increase 
as  learning  occurred  from  the  experiment. 

The  other  seed  concept  dojufillfrontap  is  at  the 
same  generalisation  level  and  a  repeat  experiment  is 
performed  on  it  as  there  is  an  unfilled  cup.  The  gener¬ 
alisation  level  for  do.aJEillfroatap  increases  to  absorb 
as  nothing  was  learned.  This  increase  corresponds  to 
confidence  in  the  existing  seed  concept  hierarchy  stim¬ 
ulating  experiments  for  a  more  general  description. 

Do.a.pour  is  attempted  next  as  it  is  at  the  lower  gen¬ 
eralisation  level  repeat.  This  experiment  pour (b, a) 
during  5/6  is  successful  and  the  generalisation  level 
increases  as  nothing  was  learned.  Since  both  seed  con¬ 
cepts  are  at  the  next  planned  generalisation  level  of 
absorb  either  is  selected  for  the  next  complete  exper¬ 
iment.  With  do.a.pour.  CAP  explicitly  tests  whether 
or  not  the  object  from  which  the  water  comes  must 
contain  liquid  at  the  start  of  the  action.  Thus  it  pours 
an  empty  cup  b  into  a  full  cup  a,  pour (b, a)  during 
6/7.  Again  the  succeeding  state  needs  no  generalisa¬ 


tion  inorder  to  be  proved  and  thus  the  generalisation 
is  confirmed.  As  learning  occurred,  the  generalisation 
level  is  unchanged. 

Although  this  generalisation  was  correctly  tested, 
the  resulting  description  is  not  in  the  intended  mean¬ 
ing  of  the  trainer’s  example  of  do.aq>our.  Pouring  an 
empty  cup  into  an  empty  cup  would  have  not  suc¬ 
ceeded  as  a  different  generalisation  would  be  required 
to  track  completion,  and  thus  inconsistency. 

But  now,  because  both  seed  concepts  are  at 
the  same  generalisation  level,  an  explicit  absorp¬ 
tion  is  attempted  on  the  concept  hierarchy  for 
do^JCillfromtap. 


With  the  body  of  ssdo-aJillfrontap  a  substitution 
is  searched  for  that  requires  an  absorptive  generalisa¬ 
tion.  This  is  possible  as  cup  a  is  full  of  water.  This 
generalisation  means  that  it  doesn’t  matter  whether  or 
not  the  cup  you  are  filling  already  contains  water,  af¬ 
terwards  it  is  guaranteed  to  contain  water.  The  experi- 


Figure  6:  World  simulation  in  state  8 

ment  f  illfromtapCa)  during  7/8  is  a  success,  the  gen¬ 
eralisation  adopted,  and  the  generalisation  level  un¬ 
changed. 

Because  both  seed  concepts  are  still  at  the  same 
generalisation  level,  they  are  attempted  alternately. 
Thus,  doji.pour  is  now  tracked  for  a  requested  ab¬ 
sorption.  As  no  more  generalisations  are  possible 
on  the  starting  state®  of  88do.a.pour’s  body  then  a 
repeat  experiment  is  performed.  Since  nothing  was 
learned  during  this  successful  experiment  pour(b.a) 
during  8/9  the  generalisation  level  for  do-a.pour  in¬ 
creases.  But  planned  intra-construction  is  not  al¬ 
lowed,  thus  testing  on  this  seed  concept  hierarchy  is 
marked  as  stopped.  Similarly,  the  other  seed  concept 
do^.Jillfromtap  is  marked  ets  stopped,  after  a  repeat 
experiment  lilllromtap(b)  during  9/10. 

Although  planned  intra-construction  is  not  allowed 
when  learning  is  exhausted  in  the  world  the  learner 
may  conduct  a  once-off  explicitly  planned  for  invention 
on  each  of  the  seed  concepts.  Learning  temporarily 
recommences  at  this  level  for  both  seed  concepts,  witli 
do-aJillfromtap  arbitrarily  chosen  first. 

Here,  ssdo^Jillfromtap  has  a  choice  of  substi¬ 
tutions  that  require  an  intra-constructive  generali¬ 
sation.  This  is  possible  with  the  test  having  ei¬ 
ther  a  cylinder  or  a  bowl  replacing  the  cup  be¬ 
ing  filled.  The  cylinder  c  is  selected  as  we  can 
see  in  Figure  7  there  is  no  water  in  the  cylinder 
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Figure  7:  World  simulation  in  state  11 


after  the  action  fillfromtap(c)  during  10/11.  Al¬ 
though  intra-construction  could  allow  the  remainder 
of  ssdo.ajf  illfromtap  to  track,  the  extra  construction 
would  cause  an  inconsistent  generaiisation.  This  failed 
generalisation  test  actually  occurs  in  conjunction  with 
the  previous  generalisation  naybe-contains-liquidtc) 
at  10  «—  contains-liquidtc)  at  10,  and  it  is  the 
conjunction  of  this  with  cuporcylinder(c)  at  10  <— 
cylinder(c)  at  10  that  is  recorded  as  having  failed. 

The  new  grounded  intra-constructed  concepts 


cuporcylinder(c)  at  10  [cup(c)  at  10) 
cuporcylinder(c)  at  10  (cylindcr(c)  at  10] 


^Actually  defined  as  an  restriction  in  {Hume,  1990). 


120  Hume 


Informally,  this  final  resulting  concept  description  for 
do.a Jillfrontap  means  that  given  a  concave  upwards 
object,  whether  full  of  water  of  not,  the  result  of  filling 
it  from  a  tap  means  it  becomes  full  of  water. 

The  ElabJustilications  is  a  recording  of  the  justi¬ 
fication  (replacing  predicate,  the  replaced  predicates, 
and  the  test  of  replacing  predicate)  for  each  general¬ 
isation  on  the  body.  FailedGans  represents  the  set  of 
conjunctions  of  generalisations  on  the  original  concept 
body  that  have  been  tested  as  being  outside  the  cor¬ 
rect  description  for  this  concept. 

Learning  switches  to  do.a.pour  with  absorb.  In  this 
situation,  two  simultaneous  absorptions  are  found  on 
different  parts  of  the  restricted  body  -  one  testing 
a  change  from  a  cup  to  a  cylinder  and  the  other 
testing  whether  the  source  object  need  contain  wa¬ 
ter  at  the  start.  The  action  performed  is  pour(b,c) 
during  17/18.  After  pouring,  there  is  a  test  predi- 


Figure  12:  World  simulation  in  state  18 


cate  not  satisfied  by  the  world  containsAiquidCc)  at 
18,  and  a  world  predicate  not  covered  by  the  test 
not.contain8JLiquid<c)  at  18.  This  means  that  the 
experiment  must  fail.  From  an  observer’s  point  of 
view  it  would  have  failed  due  to  either  of  these  elabo¬ 
rations  alone,  an  empty  source  or  an  unfillable  desti¬ 
nation.  We  mark  this  as  another  failed  generalisation. 
The  generalisation  level  does  not  increase  since  some 
knowledge  (a  failed  generalisation)  was  learned. 

Experimentation  alternates  to  do.a-filliromtap  at 
the  same  level  of  absorb.  But,  after  a  repeat  ex¬ 
periment  lilllromtapCa)  during  18/19  it  is  marked  as 
stopped  as  the  generalisation  level  cannot  increase. 

Experimentation  continues  with  do.a.pour.  Two  si¬ 
multaneous  absorptions  (different  to  any  of  the  failed 
generalisations)  are  found  on  different  parts  of  the  re¬ 
stricted  body  -  one  testing  a  change  from  the  desti¬ 
nation  cup  to  a  bowl  and  the  other  testing  whether 
the  destination  object  need  contain  water  at  the  start. 
The  action  performed  ispour(a,e)  during  19/20.  The 
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Figure  13:  World  simulation  in  state  20 

resulting  situation  requires  a  following  generalisation 
which  is  equal  to  the  initiating  generalisation  except 
for  the  substitution.  This  satisfies  consistency  and 
thus  is  a  successful  experiment.  The  generalisation 
level  remains  unchanged  and  as  the  other  seed  con¬ 
cept  is  stopped  we  continue  with  do.a.pour. 

Here,  an  absorption  is  found  on  the  restricted  body 
testing  a  change  from  the  source  object  being  a  cup 
to  being  a  bowl.  The  action  performed  is  this  case  is 


Figure  14:  World  simulation  in  state  21 


pour (e, a)  during  20/21.  Again,  a  following  generali¬ 
sation  is  required  whicli  is  consistent  and  therefore  this 
is  a  successf^ul  experiment.  The  new  grounded  concept 
description  is: 

ssdo-a-pour(a,  e)  during  20  /  21 
conc&ve.up(e)  at  20, 
concave-up(a)  at  20, 
maybe.contains.liquid(a)  at  20, 
containt-liquid(e)  at  20, 
pour(e,  a)  during  20/21, 
concave-up(e)  at  21, 
concave-up(a)  at  21, 
contains.liquid(a)  at  21, 
not-contains-liquid(e)  at  21 
ElabJustiflcations: 

initiating(concave-up(a)  at  20,  [cup(a)  at  20],  [bowl(a)  at  20]), 
initiatingCmaybe-contains-Iiquidfa)  at  20, 

[not-contains-Iiquid(a)  at  20],  [contains-liquid(a)  at  20]), 
foltowing(concave-up(a)  at  21,  (cupfa)  at  21],  [bowl(a)  at  21]), 
initiating(concave-up(e)  at  20,  [cup(e)  at  20],  |bowl(e)  at  20]), 
following(concave-up(e)  at  21,  [cup(e)  at  21],  [bowl(e)  at  21]) 
FailedGens: 

{cuporcylinder(a)  at  20  [cylinder(a)  at  20], 
maybe.contains-liquid(e)  at  20  [not-contains-liquid(e)  at  20]], 
[cuporcylinder(a)  at  20  [cylinder(a)  at  20], 
concave-up(e)  at  20  (bowife)  at  20]], 

[maybe-containa-liquid(a)  at  20  rcontain9>liquid(a)  at  20], 
maybe>contains-liquid(e)  at  20  .-  [not-contains-liquid(e)  at  20], 
concave-up(a)  at  20  [bowl(a)  at  20]] 

Informally,  this  final  resulting  concept  description  for 
do.a.pour  means  that  given  an  source  concave  upwards 
object  containing  water  and  a  destination  concave  up¬ 
wards  object  maybe  containing  water,  the  result  of 
pouring  means  that  the  destination  is  full  of  water  and 
the  source  is  empty. 

As  no  further  generalisations  are  possible  on  the  con¬ 
cept,  a  repeat  experiment,  pour(a,b)  during  21/22  is 
performed.  As  the  generalisation  level  cannot  be  in¬ 
creased,  testing  is  stopped. 

4  Discussion 

Other  concepts  which  CAP  has  learned  displaying  dif¬ 
ferent  aspects  of  its  behaviour,  include: 

Climbing  to  summit  and  back  with  overnight  camps 
(uses  belay.climb). 

Building  a  arch  of  any  height  (invented  level.or ..above, 
symmetric  not.touching  and  level.above,  and  two  new 
movement  disjuncts). 

In  these  examples  (except  water  pouring)  CAP 
could  only  learn  the  most  useful  concept  if  there  was  a 
likely  repeated  pattern  to  the  actions  in  the  intended 
concept  description. 

In  water  pouring,  a  single  action  constitutes  the  con¬ 
cept  and  the  system  discovers  and  invents  concepts  to 
generalise  preconditions  for  the  action  from  just  one 
observed  example.  Thus  for  single  action  procedures, 
CAP  and  LIVE  {Shen  &  Simon,  1989}  perform  and 
discover  equivalently  (but  with  different  representa¬ 
tions)  given  o  domain  knowledge.  This  is  because 
in  LIVE  one  can  define  an  empty  goal  causing  the  sys- 
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tem  to  merely  predict  the  effect  of  an  action  given  any 
state. 

The  mountaineering  example  of  climbing  to  the 
summit  with  overnight  stops  and  returning,  illustrates 
the  system’s  ability  to  use  concepts  that  it  has  previ¬ 
ously  learned  in  order  to  describe  new  concepts.  Here, 
belay-clinb  was  successfully  recognised  and  used  in 
moving  between  camps. 

The  building  an  arch  example  displayed  one  prob¬ 
lem  symptomatic  of  the  system’s  tendency  to  gener¬ 
alise  state  predicates  in  order  to  execute  actions.  In 
this  case  learning  an  arch  from  blocks  and  a  cross  beam 
required  aligning  the  two  base  blocks  just  inside  ar.d 
infront  of  the  ends  of  the  beam,  then  building  the 
columns  up  equally  and  putting  the  beam  on  top  of 
the  columns.  If,  during  experimentation,  only  a  short 
beam  is  available  that  causes  the  columns  of  blocks 
to  touch,  then  a  concept  inaybe.touch(a,b)  at  10  will 
be  invented  to  cover  the  difference  between  the  exist¬ 
ing  predicate  not.touchingfa.b)  at  10  and  the  current 
predicate  touching(a,b)  at  10.  When  the  generalisa¬ 
tion  causing  this  is  tested,  the  rest  of  the  experiment 
succeeds,  thus  incorrectly  accepting  it  as  a  valid  gener¬ 
alisation.  This  particular  problem  would  be  remedied 
by  actually  using  the  arch  for  the  purpose  for  which  it 
was  constructed,  ie.  by  passing  another  object  through 
the  constructed  arch.  But  as  this  would  have  to  ap¬ 
pear  as  a  part  of  the  example,  every  use  of  a  concept 
for  constructing  objects  would  need  to  appear  in  the 
examples  observed  by  the  system.  This  is  clearly  de¬ 
ficient  but  indicative  of  the  lack  of  suitability  of  CAP 
for  goal  oriented  concepts. 

Attempts  were  also  made  to  learn  problem  solving 
tasks  such  as  the  tower  of  hanoi  puzzle,  genetic  inher¬ 
itance,  and  lense  grinding,  but  without  success.  This 
failure  was  due  to  the  lack  of  goal  direction  in  CAP. 
It  simply  could  not  recognise  that  aspects  of  the  final 
state  were  vitally  important  in  learning  the  required 
concepts.  Some  aspects  of  these  concepts  were  learned 
where  the  general  form  of  the  action  ordering  in  so¬ 
lution  paths  had  common  and/or  repeated  patterns 
(most  success  with  tower  of  hanoi).  But,  in  the  worst 
case  CAP  would  have  to  be  shown  every  solution  path. 

Another  problem  is  the  sensitivity  to  extraneous 
details  during  the  observed  sequence  dissection.  If 
a  block  is  moved  left  until  touching  a  target  block, 
but  it  passes  a  distant  block  on  the  way,  then  three 
relational  predicates  change  with  the  distant  block 
(rightof,  leveljrightof,  righiof)  but  only  two  with 
the  intended  target  (not. touching,  touching).  In  this 
case  a  (iiStant  object  must  always  be  passed  on  the 
way  to  a  target  block.  Thus  the  “localised  focussing” 
heuristic  biases  the  system  to  action  sequences  where 
the  intended  relevant  objects  are  those  in  the  most 
frequently  changing  predicates. 

The  implementation  consists  of  about  3200  lines  of 
UNSW  Prolog  4.2  excluding  comments,  the  simula¬ 
tions,  and  the  interface  thereto.  Times  for  a  single 
experiment  (at  about  lOkLIPS)  ranged  from  0.5  sec 
for  water  pouring  to  15  minutes  for  arch  building  (due 
to  120  predicates  per  state  x  55  states  to  build  it). 


5  Conclusion 

Given  only  a  few  examples  and  no  background  knowl¬ 
edge  or  domain  theory  it  is  possible  to  effectively  learn 
in  a  wide  range  of  interactive  environments.  A  rep¬ 
resentation  and  mechanism  using  first-order  logic  is 
largely  responsible  for  this,  but  at  a  cost  of  controlling 
unconstrained  behaviour.  In  most  systems  with  this 
breadth  of  application,  the  solution  to  this  results  in  ei¬ 
ther,  a  large  amount  of  domain  and  control  knowledge 
(eg.  EBL,  problem  solving  learners),  or  a  large  amount 
of  explicit  oracle  interaction  (eg.  Marvin,  CIGOL). 

Without  domain  knowledge  or  control  and  without 
oracle  interaction,  effective  learning  has  been  demon¬ 
strated  that  uses  one  very  abstract  control  heuristic, 
“imitate  activity  you  see  in  the  environment”.  The 
categorisation  of  positive  and  negative  instances  of 
concepts  is  done  by  the  environment,  and  the  “drive” 
to  learn  comes  from  a  continual  search  to  try  and 
match  patterns  by  performing  them.  Thus,  recognis- 
ably  successful  executions  of  concepts  that  partially 
match  existing  concepts  result  in  more  general  con¬ 
cepts  being  adopted. 

Additionally,  concepts  learnt,  can  be  used  to  assist 
learning  of  later  concepts  since  they  become  part  of 
the  description  language.  Also,  discovery  of  clusters 
of  properties,  relations  and  actions  occurs  which  are 
grouped  by  facilitation  of  valid  procedure  executions. 
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Abstract 

A  framework  for  induction  has  been  proposed 
in  (Mugglcton  &  Buntine,  1988)  and 
implemented  in  the  CIGOL  system.  We  have 
extended  the  operators  introduced  in  CIGOL  to 
non  unit  clauses.  Doing  this,  we  have 
discovered  two  limitations  of  inversion  of 
resolution,  mainly  one  about  the  Absorption 
operator.  We  therefore  propose  a  new  operator 
c^ed  Saturation  that  replaces  Absorption  in  our 
system.  We  give  a  clear  definition  of  this 
operator  after  a  formal  analysis  of  the  problem. 
We  then  describe  how  the  two  other  operators, 
namely  Inuaconstruction  and  Truncation,  can  be 
reformulated  and  re-implemented.^ 


1  Introduction 

1.1  Motivations 

Induction  is  the  process  of  building  a  general  theory 
from  particular  facts.  This  definition  has  led  to  many 
interpretations  in  the  Machine  Learning  community, 
depending  on  the  meaning  ascribed  to  generality.  We  will 
not  discuss  this  issue  here,  and  refer  to  (Plotidn,  1971; 
Mitchell,  1982;  Kodratoff  &  Ganascia,  1986;  Buntine, 
1988;  Helft,  1988;  Kodratoff,  1988;  Niblett,  1988; 
Kodratoff,  1989)  for  a  review  of  these  different 
definitions. 

A  very  general  framework  states  that  A  is  more 
general  than  B  if  A  logically  implies  B  (denoted  A 1=  B). 
This  basically  means  that  the  models  of  A  are  also 
models  of  B;  by  that,  no  restriction  is  made  about  any 
inference  procedure  we  may  use. 

(Muggleton  &  Buntine,  1988)  specializes  the 
previous  expression  stating  that  A  is  more  general  than 
B  if  B  can  be  derived  from  A  using  SLD  rcsoiution. 


^The  first  author  has  a  French  MRT  scholarship  and  is 
partially  supported  by  CEC,  through  the  ESPRIT-2 
contract  MLT  2154.  The  second  author  is  supported  by 
PRC-IA. 


(denoted  A  I-sld  B)*  Th®  process  of  induction  can  be 
described  in  their  framework  as  follows.  Given  a  domain 
theory  T  and  an  example  E,  such  that  T  I-/-sld  E  (E  is 
not  derivable  from  T  using  SLD  resolution),  their  aim  is 
to  generate  a  new  domain  theory  T  such  that  T  I-sld  E 
and  T  I-sld  T.  T  can  then  be  induced  from  T  and  E  by 
the  process  of  'inversion  of  resolution'.  Their  system, 
call^  CIGOL  uses  3  operators  :  Absorption  that  inverts 
one  single  step  of  resolution.  Intraconstruction  that 
inverts  two  resolution  steps  of  two  clauses  with  one 
common  clause.  Truncation  that  inverts  substitutions. 

Finding  a  general  solution  to  inversion  of  resolution 
raises  some  problems,  both  from  a  theoretical  and  from  a 
practical  point  of  view.  Some  of  these  problems  are  listed 
in  section  2.  The  first  motivation  of  our  work  was  to 
find  a  simple  and  efficient  solution  for  inversion  of 
resolution  when  using  a  first  order  logic  language 
without  function  symbols.  Besides,  we  have  used  an 
automatic  representation  change  that  transfonns  functions 
into  predicates  (by  creating  one  predicate  that  has  one 
additional  argument  representing  the  result  of  the 
function)  and  vice  versa,  thus  enabling  us  to  deal  with 
arbitrary  Horn  clauses.  The  conjunction  of  these  two 
results  allows  us  to  extend  the  three  operators  introduced 
in  CIGOL  to  non  unit  clauses.  This  work  has  been 
implemented  in  Quintus  PROLOG  in  a  system  called 
IRES  (Rouveirol  &  Puget,  1989)  and  is  briefly  described 
in  section  2. 

With  this  implementation,  we  became  aware  of 
problems  which  are  not  related  to  our  particular  solution 
but  rather  are  fundamental  limitations  of  inversion  of 
resolution.  We  review  these  problems  in  section  3.  We 
also  introduce  a  solution  that  relates  inversion  of 
resolution  to  results  obtained  in  Logic  Programming.  We 
therefore  suggest  a  re-thinking  of  inversion  of  resolution 
operators  in  section  4  and  demonshate  the  expected 
improvements  on  a  small  example  in  section  5.  Finally, 
we  conclude  on  promising  research  directions  in  section 
6. 

1.2  Notations  and  definitions 

We  use  a  subset  of  first  order  logic,  namely  Horn 
Clauses  (as  in  Prolog)  as  representation  language.  We 
recall  here  the  basic  definitions  used  in  Logic 
Programming,  and  refer  to  (Lloyd,  1987)  and  (Genesereth 
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&  Nilson,  1987)  for  an  extended  treatment  of  this 
subject 

A  predicate  is  given  by  a  predicate  symbol,  and  an 
arity.  An  atom  is  a  predicate  applied  to  terms,  i.e.,  an 

expression  of  the  following  form  :  p(ti . tm)  where  p 

is  a  predicate  symbol,  and  the  tj  are  terms.  A  positive 
literal  is  an  atom,  a  negative  literal  is  the  negation 
of  an  atom. 

Briefly,  a  substitution  is  a  finite  set  of  pairs  xj  Ai, 
where  xi  is  a  variable  and  tj  a  term.  A  substitution  a  is 
applied  to  a  formula  F  by  replacing  each  occurrence  of  the 
variables  by  the  corresponding  term,  the  result  is  noted 
Fa.  If  {(si  ,ti ))  is  a  set  of  pairs  of  terms  or  a  pair  of 
atoms,  a  uniHer  is  a  substitution  a  such  that  for  every 
i,  sia  =  tia.  A  most  general  unifier  (mgu)  is  a 
substitution  a  such  that  for  every  unifier  0,  there  exists  a 
substitution  y  such  that  a  =  Gy. 

A  clause  is  a  finite  disjunction  of  literals  with  all  its 
variables  universally  quantified.  A  Horn  clause  has  at 
most  one  positive  literal.  A  goal  clause,  is  a  clause 
with  no  positive  literal  and  is  denoted  <-  Bia...a  Bm. 
A  definite  clause  is  a  clause  with  exactly  one  positive 
literal  and  is  noted  P<-BiA...ABni.  P  is  the  head  of  the 
clause,  and  the  conjunction  Bia...a  Bm  is  the  body  of 
the  clause.  A  unit  clause  has  an  empty  body.  By 
logic  program,  we  intend  a  set  of  definite  clauses, 
i.e.,  it  does  not  contain  any  goal  clause. 

P  entail  F  is  noted  P  |=  F  ,  and  means  that  the 
formula  F  «-  P  is  valid. 

In  this  framework,  a  concept  is  represented  by  a 
predicate.  A  clause  is  interpreted  as  a  concept  definition; 
its  head  identifies  the  defined  concept,  its  Irady  lists  the 
conditions  for  belonging  to  the  concept. 

In  the  remaining  of  the  paper,  two  definition  of 
generality  will  be  used.  The  first  one  is  used  in  CIGOL 
and  the  first  implementation  of  IRES  :  a  theory  T  is 
more  general  than  a  formula  F  if  and  only  if  F  can  be 
proved  from  T,  that  is  T  I-sld  F-  In  such  a  case  formula 
F  is  said  to  be  covered,  or  explained,  by  theory  T  : 
explanation  means  proof.  The  second  one  is  iused  in  a 
latter  version  of  IRES  ;  a  theory  T  is  more  general  than  a 
formula  F  if  and  only  if  T  entails  F,  that  is  T  1=  F.  In 
such  a  case  T  also  explains  F. 

1.3  Resolution 

We  consider  in  the  following  special  case  of 
resolution,  the  one  used  in  PROLOG,  called  SLD 
resolution.  It  is  defined  as  follows.  Let  us  consider  two 
clauses 

(Cl):  Hi<-AAa 

(C2):  H2<-P 

where  a  and  P  are  conjunctions  (maybe  empty)  of 
atoms.  A  is  an  arbitrary  literal  of  the  body  of  (Cl),  and  a 
is  the  remaining  of  the  body  of  (Cl).  We  say  that  (Cl) 
can  be  resolved  with  (C2)  if  H2  and  A  unifies  with  mgu 


0.  We  have  then  A0=:H20.  The  result  of  the  resolution 
step  is  the  new  clause; 

Res(Cl,C2):  Hi0«-P0Aa0 

By  such  definition  of  resolution,  we  impose  that  the 
resolution  step  unifies  the  head  of  (C2)  with  a  literal 
within  the  body  of  (Cl). 

Note  that  (Cl)  can  also  be  a  goal  clause,  i.e.  the 
clause  without  head 

(Cl):  <-  A  A  a 

in  such  a  case  the  result  of  the  resolution  step  is 
Res(Cl,C2):  <-p0Aa0 

2  Inversion  of  resolution 

2.1  CIGOL 

In  order  to  give  an  hint  of  the  difficulties  to  face  to 
solve  inversion  of  resolution  in  the  general  case,  we  give, 
after  (Muggleton  &  Buntine,  1988),  the  formula  used  for 
inverting  a  single  step  of  resolution  in  the  Absorption 
operator.  If  Cl  and  R  are  clauses,  all  clauses  C2  such  that 
R  is  the  resolvent  of  Cl  and  C2  can  be  computed  with 
the  following: 

C2  =  (R-(C1-  (Ll)).01).02-Iu  (L2)  (1), 
where,  LI  and  L2  are  the  resolved  literals  in  Cl  and 
C2, 01  and  02  are  the  associated  substitutions  (LI  *  01  = 
L2  •  02),  and  02'I  is  the  'inverse  substitution'  of  02.  An 
inverse  substitution  basically  replaces  constants  or  terms 
by  variables.  It  can  be  viewed  as  an  extension  of  the 
turning  constant  into  variables  generalization  rule 
(Michalski,  1983). 

CIGOL  only  provides  a  partial  solution  for  this 
equation,  that  is  it  solves  this  equation  in  the  case  where 
Cl  is  a  unit  clause  (in  other  terms.  Cl  •  (LI)  is  empty). 
Equation  (1)  then  simplifies  to 

C2  =  (R.01).02-Iu  {L2) 
where  the  unknowns  are  02‘1, 01  and  L2.  The  main 
difficulty  is  to  build  02' I.  During  the  resolution  step  02 
substitutes  one  single  variable  in  C2  with  an  arbitrary 
term  which  is  also  found  in  (R  •  01);  conversely,  for 
building  02'I,  CIGOL  has  to  choose  which  subterms  in 
R  to  replace  by  one  common  variable  in  C2.  There  can 
be  a  combinatorial  explosion  at  this  step.  In  CIGOL,  the 
choice  of  this  'inverse  substitution'  is  guided  by 
optimization  of  information  compression  when  replacing 
C2  by  R  (see  (Muggleton,  1988)  for  details  about 
CIGOL  strategy). 

2.2  IRES 

2.2.1  Working  hypothesis 

We  have  chosen  a  different  simplifying  hypothesis  to 
solve  inversion  of  resolution.  We  have  kept  in  a  first  step 
the  same  structure  as  CIGOL  (that  is,  with  three 
operators  Absorption,  Intraconstruction  and  Truncation) 
but  we  only  deal  with  clauses  without  function  symbols. 
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This  simplifies  a  lot  the  descriptions  and 
implementations  of  the  operators.  In  order  to  overcome 
the  strong  limitation  introduced  by  our  working 
hypothesis,  we  use  an  automatic  representation  change 
that  transforms  functions  into  predicates  (flattening) 
and  vice  versa,  without  changing  the  semantics  of  the 
overall  program.  Let  us  briefly  describe  this 
representation  change  with  a  small  example.  The  clause 
to  flatten  is : 

(Ex):  member(blue,[blue])<- 

The  second  argument  of  the  predicate  member,  [blue], 
is  in  fact  the  term  cons(blue,nit).  The  idea  of  flattening  is 
to  replace  the  function  cons(X,Y)  by  a  new  predicate 
consp(X,Y,Z),  where  7-  is  the  result  of  applying  cons  to 
X  and  Y.  In  our  example,  this  replacement  yield?  the  new 
clause: 

(Ex'):  member(blue(Z)^  consp(blm,niliZ) 

The  new  predicate  consp  is  defined  as  follows  to 
preserve  the  semantics  of  the  original  clause: 

(CONS):  consp(X,  Y.  cons(X.Y))^  . 

The  meaning  of  the  input  clause  (Ex)  has  not  been 

changed  since  the  resolution  of  (Ex')  with  (CONS)  gives 
(Ex),  as  can  be  checked.  This  is  a  general  property  of  the 
fiattening  transformation.  The  flattening  algorithm 
performs  these  transformations  for  all  the  occurrences  of 
functions  in  the  clauses,  in  particular  for  the  constants, 
since  there  are  functions  of  arity  0.  For  instance,  blue  is 
replaced  by  a  predicate  bluep(A),  etc.  The  final  result  of 
flattening  is  thus  the  following  clause: 

(Ex"):  member(X.Z)  <- 

consp(X,Y,Z)  btuep(X)  a  nilp(Y). 

together  with  the  clauses  that  define  the  predicates 
consp^  bluep  and  nilp : 

(CONS):  consp(X,  Y.  cons(X.Y))<-  . 

(SWE):  bluep(blue)  ^ . 

(NIL):  nilp(nil)^. 

This  kind  of  representation  transformation  is  classical 
in  Logic  Programming  and  has  been  done  manually  in 
the  system  MARVIN  (Sammut,  1981).  Our  originality  is 
to  have  automatized  it. 

It  is  worth  noticing  that  two  identical  subterms  are 
replaced  by  a  single  variable.  Here,  the  blue  constant 
which  appeared  twice,  is  replaced  by  the  single  variable 
X.  This  choice  simplifies  a  lot  the  search  for  inverse 
substitution  in  Absorption  (Rouveirol  &  Puget,  1989). 
By  doing  this,  we  stick  very  much  to  the  example,  and 
we  are  of  course  more  sensitive  to  coincidental 
occurrences  of  one  value  in  the  example.  We  can  get  rid 
of  these  coincidences  by  further  generdizing  the  examples 
when  applying  Intraconstruction. 

The  flattening  transformation  enables  to  handle 
arbitrary  Horn  clauses  as  input  and  output  for  the 
operators.  In  practice,  it  allows  us  to  relax  the  unit  clause 
assumption  put  on  the  input  clauses  of  Absorption, 
Intraconstruction  and  Truncation  in  CIGOL. 


2.2.2  A  sample  session  of  IRES 

Let  us  illustrate  how  IRES  proceeds  to  complete  the 
following  theory  defining  family  relationships  (CIGOL 
cannot  handle  ^is  example  because  of  the  unit  clause 
restriction). 

Tj:  gran(^ather(X^)i- 

father(X,Y)  a  father(Y,Z). 

T2:  father(X,Y)  <- 

Mld_of(YX)  A  sex(X,tmle). 

T3:  mother(X,Y)  <- 

cldld_of(Y pC)  A  sex(X female). 

T2  and  T3  are  flattened  into : 

T2f:  father(X,Y)<r- 

c.hild_of(YX)  A  sexlXS)  a  male(S). 

Tsf:  mother(X,Y)  <r- 

child_of(YX)  A  sex(X,S)  Afemale(S). 

Suppose  now  that  the  following  example  is  met. 

E:  grandfather(iom,liz)  *- 

father(tom,helen)  a  child_of[liz,helen)  a 
sex(helenfemale). 

which  yields,  once  flattened : 

Ep  grandfatherlX'X')  *r- 

father(X’X)  a  child  of(Z'.Y')  a 
sex(Y',S')  A  tom(X')~A  liz(Z')  a 
helen(Y’)  Afermle(S'). 

E  is  not  entailed  by  the  available  theory,  because 
father(helen,liz)  cannot  be  derived,  by  resolution  with  one 
of  the  clauses  of  the  domain  theory,  from 
child_of(liz,helen),  sex(helen,S)  and  female(S) ,  but 
mother(helen,liz)  can  be  derived  :  the  definition  of 
grandfather  is  clearly  incomplete. 

In  the  remaining  of  the  example,  IRES  proposes  to 
the  user  all  the  operators  that  can  be  applied  at  one  step, 
and  the  user  chooses  which  operator  to  apply.  At  this 
step,  all  three  operators  are  applicable,  the  user  chooses 
Absorption,  as  it  generalizes  the  expression  of  the 
example. 

2.2.2. 1  Absorption 

Absorption  rewrites  one  clause  C2  (be  it  an  example 
or  a  rule  of  the  domain  theory)  using  a  concept  defined  in 
another  clause  Ci.  Absorption  of  a  clause  Ci  (the 
absorbed  clause)  by  a  clause  C2  (the  absorbant  clause)  is 
possible  if  the  body  of  clause  C2  can  be  unified  with  a 
part  of  the  body  ofCi. 

In  our  example,  there  is  only  one  clause  that  can  be 
absorbed  by  the  example  E,  namely  T3.  This  clause  can 
be  absorbed  by  Ef  because  its  body 

child_of(Y ,X)  A  sex(X,S)  a  female(S)  can  be 
unified  with  the  subpart 

child_of(Z',Y')  A  sex(Y',S')  a  female(S') 
of  the  body  of  Ef  with  the  substitution  {YIZ',  X/IY', 
S/S'}. 
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Absorption  then  replaces  the  body  of  the  absorbed 
clause  C2  by  its  head  properly  substituted  in  the  body  of 
the  absorbant  clause  Ci. 

In  our  example,  this  gives  the  following  new  clause: 

Ef:  grandfalher(X' IZ!)  ir- 

father(X',Y')  a  mother(Y'^')  a 
tom(X')  A  liz(Z’)  A  helen(Y'). 

At  this  step,  Absorption  is  not  applicable  anymore, 
remains  the  choice  between  Intraconstruction  and 
Truncation.  Let  us  suppose  the  user  chooses  to  apply 
Intraconstruction. 

2.2.2.2  Intraconstruction 

Intraconstniction  compares  its  two  input  clauses  and 
rewrites  them  by  inuoducing  a  new  intermcdiaiy  predicate 
to  get  a  more  concise  expression  of  the  theory. 
Intraconstruction  can  occur  between  two  clauses  Ci  and 
C2  if  there  exists  a  non  empty  common  generalization  of 
the  two  clauses.  This  genei^ization  must  cover  the  heads 
of  Cl  and  C2.and  at  least  one  literal  in  the  bodies  of  Ci 
and  C2.  It  proceeds  in  three  steps. 

Firstly,  the  system  generates  a  new  clause  GG  the 
head  of  which  is  the  generalization  of  the  heads  of  Ci  and 
C2.  The  body  of  is  the  generalization  of  the  bodies  of  Ci 
and  C2.  For  instance,  the  first  step  of  applying 
intraconstruction  to  Ep  and  Ti  gives  the  following 
clause: 

GG:  grand-father(U,W)  <r-  father(U,V), 

The  second  step  takes  care  of  the  left-over  literals  in 
Cl  and  C2.  Here,  the  literals  mother(Y',Z')  a  tom(X')  a 
liz(Z')  A  helen(Y')  of  E'  and  father(YiZ)  of  Ti  have  been 
left  over  during  the  generalization  step  of 
intraconstruction.  Inuaconstruction  then  introduces  a  new 
concept,  arbitrarily  called  newp .  This  new  concept  is 
defined  in  two  clauses  whose  bodies  are  respectively  the 
the  left-over  literals  in  the  initial  clauses.  The  arguments 
of  newp  have  to  be  carefully  chosen  to  keep  the  variables 
bindings  that  were  pre.sent  in  Ci  and  C2,  such  that  no 
generalization  occurs  during  this  step.  In  our  example, 
these  new  clauses  are: 

Tjh  :  newp(Y,Z)  ^  father(Y,Z). 

Tjee :  newp(X',Y',Z')  *-  mother(Y',Z')  a 

tom(X')  A  liz(Z')  A  helen(Y'). 

We  notice  that  in  Tic^  ,  we  need  one  more  argument 
than  in  Tib  to  keep  to  bindings  between  the  leftover 
literals  and  the  generalizadon.  The  system  therefore 
suggests  to  simplify  E'  before  proceeding  to 
Intraconstruction  in  order  to  have  symmetrical  bindings 
for  arguments  of  newp. 

2.2.2.3  Truncation 

Truncation  generalizes  a  single  clause  by  turning 
terms  into  variables.  This  is  done  by  dropping  some 
literals  among  the  ones  introduced  by  the  flattening 
algorithm.  In  our  example.  Truncation  can  at  most  be 


used  to  turn  constants  of  Ticc  into  variables.  Previous 
Intraconstruction  suggested  that  tom(X')  should  be 
dropped  from  E'.  It  gives  the  more  general  clause: 

Ef:  grcm^ather(X'Z')  <- 

faiher(X',Y')  a  mother(T,Z')  a 
liz(Z')Ahelen(Y'). 

Additional  information  about  the  domain  stating  that 
first  names  are  not  relevant  in  this  context,  or  oracle  can 
allow  to  drop  lizCZ)  and  helen(Y'),  but  we  stick  to  Ef-  for 
our  example. 

2.2.2.4.  Intraconstruction  (continued) 

Going  back  to  Intraconstruction  ,  where  the  two  input 
clauses  are  now  Ti  and  Ep .  The  generalization  of  the 
two  clauses  remains  the  same  as  in  section  2.2.2  clauses. 
The  two  new  clauses  new  clauses  to  define  newp  are: 

Tjb :  newp(YZ)  ^  father(YZ). 

Ticc :  newp(Y'Z')  mother(Y',Z')  a 

liz(r)  A  helen(Y'). 

The  bindings  for  Tib  and  Ticc  are  now  symmetrical 
due  to  the  use  of  truncation.  We  can  then  proceed  to  the 
last  step  of  Intraconstruction  that  replaces  Ep  and  Ti  by 
the  following  three  clauses: 

Tja:  grand-father(U,W)^  father(U,V)  a 

newp(V,W). 

Tib  ■  newpiYZ)  father(YZ). 

Ticc  •  newp(Y’Z')  ^  mother(Y',Z')  a 

liz(Z')  A  helen(r). 

An  oracle  is  needed  to  further  simplify  Ticc  >•’10  Tic 
using  truncation : 

Tic:  newpiY'Z')  *-  mother(Y’,Z'). 

The  oracle  is  then  asked  to  validate  and  name  the 
predicate  newp .  The  proposed  name  is  parent. 

The  new  theory  T  formed  of  the  clauses  (Tia,  Tib, 
Tic.  T2,  T3).  This  theory  T  is  able  to  recognize  Ae  new 
example,  and  all  other  examples  of  maternal  grandfathers. 

3  Saturation 

3.1  Two  problems  with  inverse  resolution 

While  implementing  the  above  described  operators  in 
the  Ires  system,  we  have  discovered  some  rather 
fundament^  limitations  of  inverse  resolution. 

3.1.1  Truncation 

The  first  problem  when  we  consider  non  unit  clauses 
arises  with  Truncation.  Truncation  as  pure  inversion  of 
resolution  operator  allows  to  get  rid  of  literals  introduced 
during  the  flattening  step.  It  corresponds  in  a  functional^ 
notation  to  generalize  a  clause  by  replacing  some  terms 


2  By  functional  notation  we  mean  the  unflattened 
representation  for  clause,  as  opposed  to  the  flattened 
representation. 
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by  variables.  It  seems  appealing  to  extend  Truncation  so 
that  we  can  drop  any  literal  in  a  flattened  representation. 
Dropping  literals  that  were  not  introduced  by  flattening 
corresponds  to  the  dropping  condition  rule  in  a  functional 
representation  (Michalski,  1983).  One  single  rule  in  a 
flattened  representation  (dropping  any  literal)  is  sufficient 
to  express  the  tv'o  generalization  rules  that  are  necessary 
in  a  functional  representation  (that  is,  turning  terms  into 
variables,  dropping  literals). 

The  problem  with  this  extension  is  that  it  is  out  of 
the  scope  of  pure  inversion  of  resolution.  From  a  clause 
A  <-  B  A  C,  we  can  now  obu  in  the  clause  A  <-  C  with 
Truncation.  However  this  is  not  the  reverse  of  a 
resolution  step.  Using  SLD  resolution  it  is  not  possible 
to  derive  the  clause  A  <-  B  a  C  from  the  clause  A  <-  C. 

This  is  one  of  the  reasons  that  led  us  to  introduce  a 
different  definition  for  generality.  A  is  more  general  than 
B  if  .A  1=  B.  This  definition  allows  us  to  say  A  <-  C  is 
more  general  than  A  <-  B  a  C  because 

(A  <-  C)  1=  (A  ^  B  A  C), 

whereas 

(A  «-  C)  I-/-  SLD  (A  B  A  C). 

This  extension  of  our  definition  of  generality  will  be 
also  justified  by  the  sections  to  come. 

3.1.2  Absorption 

Saturation  has  its  origin  in  some  chaining  problem  we 
had  with  Absorption.  Absorption  is  destructive  :  it 
replaces  in  the  body  of  one  clause  the  body  of  another 
clause  by  its  head,  llie  literals  that  have  been  replaced  are 
lost,  and  this  can  have  harmful  effects  on  further 
generalizations  or  Intraconstructions  if  the  wrong  choice 
has  been  made  for  the  clauses  involved  in  the  Absorption, 
or  concerning  the  order  in  which  Absorptions  occurs^. 
For  the  sake  of  clarity,  let  us  take  an  example  with  the 
two  following  clauses  in  our  domain  theory : 

A  < —  B  A  C  A  E 

B< —  C  A  D. 

and  then  the  following  example  comes 

F  <—  C  A  D  A  E. 

Absorption  of  the  example  with  the  second  clause 
would  yield  the  clause 

F<-B  aE 

Literals  C  and  D  are  removed,  thus  preventing  one 
more  absorption  with  the  first  clause  of  the  theory. 
Beside  the  fact  that  Absorption  is  not  complete  since  all 
possibilities  cannot  be  tried,  this  have  another  bad  effect. 
The  order  into  which  Absorption  is  applied  is  very 
important.  However  the  good  choice  cannot  be  known  in 
advance.  Thus  a  large  search  has  to  be  performed. 

A  better  solution  would  be  to  keep  all  the  literals  in 
the  body  of  the  clause  when  performing  Absorption.  We 
call  this  method  Saturation.  This  would  first  yield  the 
clause 


^  This  point  has  been  made  by  S;  Muggleton  in  a 
private  communication. 


F<-[CaD]aEaB 

then  applying  Absorption  again  with  the  first  clause 
gives 

F<— [CaDaEaB]aA 
Brackets  denote  that  the  literals  have  been  already  used 
for  an  absorption  step,  and  are  therefore  optional  (they 
can  be  dropped  in  a  later  simplification  step).  This 
notation  is  taken  from  (Loveland,  1978). 

Thus  the  Saturation  operator  overcomes  the  destructive 
effects  of  Absorption  by  doing  a  kind  of  forward  chaining 
on  the  body  of  one  clause  as  the  following  logical 
analysis  will  show.  It  also  offer  a  good  alternative  to  the 
search  needed  for  Absorption.  The  clause 
F<— [CaDaEaBJaA 
is  a  compact  way  of  representing  all  the  clauses  that 
should  to  be  tried  with  Absorption.  These  clauses  are 
obtained  by  removing  some  of  the  bracketed  literals. 
Some  of  the  possible  16  clauses  are  given  below : 

F<— CaDaEaE'aA 
F<-DaEaBaA 
Ff-EA  A 
F<- A 

3.2  Logical  analysis 

The  preceding  remarks  led  us  to  introduce  a  sensibly 
different  framework  for  induction.  The  starting  hypothesis 
remains  the  same:  a  domain  theory  T  (definite  clause 
program)  is  given,  as  well  as  an  example  clause  (B  Bi 
A  B2  A ....  A  Bn )  not  explained  by  the  domain  theory,  i.e. 
not  entailed  by  T,  The  goal  of  learning  is  to  obtain  a 
clause  H,  known  as  induction  hypothesis,  such  that 
T  A  H 1=  (B  <—  Bj  A  B2  A ....  A  Bn  ) 

This  can  be  transformed  into 

Ta— i(B<— Bia  B2  A ....  A  Bn )  1=  — 1 H 
The  next  step  is  to  remove  all  the  variables  in  the 
negated  example  clauses  by  skolemizing  them. 
Skolemizing  amounts  to  apply  a  substitution  6 
involving  constants  which  do  not  appear  in  the  theory. 

T  A  — I  (B  4—  Bj  A  B2  A ....  A  Bn  )9 1=  ~i  H 
This  can  be  transformed  into 

T  A  — 1 B0  A  BjB  A  B20  a  ....  A  Bn0 1=  “i  H 
The  problem  01  finding  H  is  now  reduced  to  deduction. 
Our  solution  is  to  use  a  forward  chaining  deduction  on 
Bi0  A  B20  a  ....  A  Dn0  (which  is  the  body  of  the  example 
clause)  using  the  theory  T.  Let  us  consider  any 
conjunction  of  atoms  Ai  a  A2  a....  a  Am  that  can  be 
proved  that  way.  We  thus  have 

T  A  Bi0  A  B20  a  ....  A  Bn0  1= 

Ai  aA2A.  ...a  Am 

We  then  have,  by  adding  -1 B0  to  both  sides 
T  A  — 1 B0  A  B20  A  B20  a  ....  A  Bn0 1= 

-1 B0  A  Ai  A  A2  A ....  A  Am 
By  factorizing  negations  we  obtain 

T  A  — 1  (B  <—  Bj  A  B2  A ....  A  Bn  )0 1= 

—1  (B0  4—  Aj  A  A2  A ....  A  Am  ) 
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The-  l^st  slop  is  to  undo  the  skolcm  substitution,  i.e. 
replace  euch  skolem  constant  introduced  by  6  with  a  new 
variable.  We  note  this  operation  0‘^  We  have  proved  that 
we  then  have  the  logicd  entailment : 

T  A— 1  (B  <— Bj  A  B2  A  ....  A  Bji)  1= 

—I  (B0  <—  Aj  A  A2  A ....  A  Ani  )0"^ 

Thus  H  is  taken  as  (B0  <-  Ai  a  A2  a  ....  a  Am  )0‘^. 
We  have  also  proved  that  all  the  hypothesis  H  having  the 
same  predicate  symbol  in  the  head  as  in  the  example  can 
be  obtained  that  way,  which  was  not  the  case  with 
Absorption. 

The  last  refinement  of  this  method  is  the  use  of 
optional  literals  (which  are  put  between  brackets)  when 
performing  deduction  .  This  refinement  of  resolution  is 
analyzed  in  details  in  (Loveland,  1978). 

Saturation  has  already  been  applied  in  the  context  of 
generalization  (Bisson,  1989),  although  the  use  of 
saturation  to  get  to  a  generalization  ir  the  two  approaches 
is  quite  different. 

3.3  Example 

We  present  the  principle  of  Saturation  on  an  example 
taken  from  (Wirth,  1988)  as  the  application  domain, 
namely  completion  of  Definite  Clause  Grammars,  is 
particularly  suited  for  'Jiis.  Tlie  domain  theory  expresses 
natural  language  parsing  rules.  Casting  Wirth's 
representadon  into  ours  gives  the  following : 

(LFPl);  sentence(SO^)  >!- 

noun _phrase(SO,Sl)  a 
■’/erb _phrase(Sl  ,S) 

(LFPl) :  noun _phrase(SO,S)  <- 

determner(SO,Sl)  a 
noun(Sl,S). 

(LFP3):  noun  j)hrase(SO,S)  f-  name(S0,S). 

(LFP4):  verb _phrase(SO,S)  <- 

iniransitive_yerb(SO,S). 

(LFP5):  noun(X,S)  <-  cons^man,SiX). 

(LFP6):  name(X^)  <-  consp(sue,SX)- 
(LFP7):  determiner(X^)  <-  consp(a,SX). 

(LFP8):  determiner(X,S)  <-  consp(the,SX)- 
(LFP9):  intransitive _yerb(XS)  f-  consp(sleeps,SJiX). 
(LFPIO):  transitive _yerb(X,S)  consp(ioves,SX). 

We  start  with  the  flat  representation  of  the  input 
clause  sentence([sue,loves,a,manjS] ,S)  (for  the  sake  of 
readability,  we  did  not  flatten  constants  in  the  example) : 
(I):  sentence(SO,S)  f- 

conSp(sue,Sl ,S0)  a  consp(loves,S2,Sl)  a 
consp(a,S3^2)  a  consp(man,S,S3). 

This  example  is  skolemized  by  replacing  all  variables 
by  new  constants  s,  sl,s2,s3: 

(Is) :  sentence(sO,s)  (- 

consp(sue,sl  ,s0)  a  consp(loves,s2,sl)  a 
conSp(a^3,s2)  a  conSp(man,s^3). 


SaturaUon  builds  in  one  pass  the  bottom  up  parsing  of 
this  sentence,  rewriting  parts  of  the  clauses  in  terms  of 
higher  level  concepts.  Tlie  result  of  Saturadon  will  be : 
(Iss) :  sentence(sO,s)  <- 

[consp(sue^l  M))  a  consp(loves^2M)  a 
cons^a,s3^2)  a  consp(man^,s3)  a 
name(sO,sl)]  a  determiner(s2^3)  a  noun(s3^) 
A  transitive _yerb(sl,s2)  a  noun _phrase(sO,sl) 
A  noun j)hrase(s2,s) . 

The  last  step  is  to  undo  the  skolem  substitudon, 
which  gives : 

(If) :  sentence(SO^)  <- 

[consp(sue^l  JSO)  a  consp(loves,S2,SI)  a 
cons^a,S3^2)  a  consp(man^,S3)  a 
name(S0SI))  Atransitive_yerb(SlJS2)  a 
[detenmner(S2^3)  a  noun(S3,S)}  a 
noun j)hrase(SO^I)  a  noun _phrase(S2^). 

From  now  on  we  will  not  indicate  the  skolemizing 
step  since  it  is  straightforward.  The  result  of  Saturadon 
can  be  clearly  seen  on  the  following  graph,  where  nodes 
stand  for  literals  of  the  clause  and  arrows  denotes  the 
deduedon  made  while  saturing.  Low  level  as  well  as  high 
level  concepts  appears  in  the  body  of  the  clauses.  There 
are  of  course  r^undancies,  but  ail  the  information 
contained  in  the  body  of  the  clause  is  explicit. 


sentence(SO,S) 


noun_phrase(SO,Sl)/  noun_phrase(S2,S) 

A  transitive_veri)(Sl,S2) 
name(S0,Sl)  determiner(S2,S3)noun(S3,S) 

^  consp(loves,S2,Sl)  ^  ^ 

consp(sue,S  1  ,S0)  consp(a,S3,S2)  consp(man,S,S3) 


Figurel :  Graphical  representadon  of  a  saturated  clause 

3.4  Boundaries  of  generality 

The  above  graph  can  be  interpreted  with  a  generality 
reladon  ;  the  higher  level  literals  are  more  general  than 
the  others.  This  seems  to  contradict  the  definidon  of 
generality  we  lave  used  before  since  the  higher  literals  are 
deduced  from  the  lower  ones. 

In  fact,  what  is  compared  is  not  literals,  but  clauses, 
ill  iuci  die  above  graph  potentially  represents  a  let  of 
different  clauses,  and  what  should  be  compared  for 
generality  are  these  clauses,  not  simple  literals.  To  obtain 
one  of  these  clauses,  one  should  draw  a  line  across  the 
graph,  called  boundary  of  generality,  and  keep  all 
the  literal  just  above  this  line.  For  instance,  if  we  use  the 
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boundary  of  generality  (Bl)  in  figure  2,  we  obtain  the 
ckiiie : 


sentence(SO,S)<- 

noun _phrase(SO,Sl)  a  transitive-verb(Sl  ,S2)  a 
noun j)hrase(S2,S3). 

If  take  the  boundary  of  generality  (B2)  we  obtain : 
sentence(SO,S)<~ 

consp(sue,Sl  ,S0)  a  transitive-verb(Sl  ,S2)  a 
noun j)hrase(S2,S3). 

The  first  clause  is  more  general  than  the  second  one. 
This  is  why  we  can  say  that  noun _phrase(SO,Sl )  is  more 
general  than  consp(sue,Sl  ,S0)  with  respect  to  the  above 
graph.  Thus  the  graph  represents  generality  level  of 
literals  ;  tlie  higher  a  literal  is,  the  more  general  will  be  a 
clause  with  this  literal  in  the  body.  Note  that  this  notion 
of  boundary  of  generality  is  similar  to  the  one  opf 
boundary  of  operationality  from  (Braverman  &  Russel, 
1988).  This  will  be  used  in  the  intraconstruction  and 
truncation  operators  as  described  in  section  4.2. 


sentcnce(SO,S) 


noun_phrasc(SO,S  1)^  noun_ghrase(S2,S) 
Bl  ^  uansitive^verb(Sl,S22^ 


name(S0,Sl) 


determiner(S2,S3)  noun(S3,S) 


consp(loves,S2,Sl) 


t  t 


consp{suc,Sl,SO)  consp(a,S3,S2)  consp(man,S,S3) 


Figure  2 :  Saturated  clause  with  boundaries 
of  generality 

Depending  on  the  learning  goal,  it  is  now  possible  to 
favour  higher  level  terms,  and  to  prune  some  parts  of  this 
graph.  The  control  of  the  operators  is  thus  easier  than 
with  Absorption  :  all  the  relevant  information  is  explicit 
when  choices  have  to  be  made.  Various  bias  can  be  used 
at  this  point.  If  we  retain  only  the  connected  top  (see 
section  4  below)  we  obtain  the  clause : 

(Ir) :  sentence(SO,S)(- 

iiOUii _phras6(SG,Sl )  /\  ifCttiSiiiV6-VCI'b(Sl  ,S2)  A 
noun _phrase(S2,S3). 

which  was  the  missing  clause  in  the  theory.  This 
clause  can  now  undergo  intraconstruction  with  (LFPl)  to 
complete  the  definition  of  verb _phrase. 


4.  Intraconstruction  and  Truncation 
revisited 

4.1  General  algorithm  of  N-IRES 

The  previous  remarks  implies  a  total  rethinking  of  our 
previous  approach  to  inversion  of  resolution. 
Intraconstruetion  and  Truncation  will  from  now  on 
operate  on  the  graphs  produced  by  Saturation.  This  will 
allow  us  to  define  explicit  control  by  taking  into  account 
level  of  generality  in  the  graphs.  The  generalization  step 
of  Intraconstruction  becomes  a  matching  between  the  two 
input  graphs,  where  nodes  are  literals  partially  ordered 
with  subsumption.  As  the  literals  do  not  have  the  same 
generality  level.  Truncation  operates  with  higher  priority 
on  low  level  literals,  which  are  already  subsumed  by 
some  other  terms  in  the  generalization.  Truncation 
becomes  a  central  operator  that  prevents  the  resulting 
clauses  after  Saturation  to  grow  immoderately  in  size. 
Intraconstruction  also  benefits  form  the  graph  structure. 
Of  course,  partial  matching  of  two  graphs  is  np- 
cornplete,  but  matching  can  be  seen  as  finding  some 
paths  in  the  graph  starting  from  the  heads  of  the  clauses 
(we  are  then  brought  back  to  a  tree-matching  problem). 
We  can  use  as  well  the  fact  that  the  literals  are  ordered  in 
the  graph  to  constrain  the  order  in  which  the  literals  are 
going  to  be  generalized.  A  possible  bias  for  the 
generalization  step  is  proposed  in  the  next  section. 

The  global  algorithm  of  N-IRES  is  now  the 
following.  We  suppose  that  examples  comes  one  by  one, 
and  are  handled  as  soon  by  the  system. 

For  each  new  example,  a  subsumption  test  is 
P'-formed  (like  the  one  from  (Buntine,  1988)).  If  the 
e"  pie  is  subsumed  by  the  domain  theory,  nothing  is 
doi,.;.  If  the  example  is  not  explained  by  the  theory,  the 
input  is  saturated  using  Uie  original  domain  theory,  then 
it  may  undergo  IntraconsU'uction  or  Truncation.  The 
results  of  these  operators  are  submitted  to  the  oracle  that 
validates  them  or  not.  Once  a  result  has  been  validated,  it 
is  added  to  the  domain  theory.  The  system  then  checks 
whether  any  clause  of  the  original  theory  can  be 
simplified  (by  a  sequence  of  Saturation  -  Intraconstruction 
-  Truncation)  using  this  new  clause.  If  it  is  the  case,  the 
expression  of  the  theory  is  simplified.  This  is  similar  to 
the  rulc-reexamination  from  (Hall,  1988)  and  allow  the 
system  to  be  less  sensitive  to  the  order  of  the  examples. 

The  small  following  example  gives  a  hint  of 
improvements  of  Intraconstruction  made  possible  by 
Saturation. 

4.2  Example 

We  concentrate  here  on  learning  the  rule  for  the 
general  case  of  arithmetic  addition.  Integers  are 
represented  using  the  constant  0  and  the  successor 
function  s:  s(X)  represents  the  integer  X+1. 


Deyond  Inversion  of  Resolution  129 


The  first  input  is 

(II)  :  1+1  =  2,  which  is  represented  by 
plus(s(0).s(0Ms(0))). 

Since  two  identical  constants  arc  replaced  by  identical 
terms,  this  will  give  the  following  flattened 
representation : 

(12/)  :plus(XPC.Y)  +- 

zero( 0  A  succ( a  succ(X,Y). 

The  next  input  is  (12)  :  1  +  2  =  3.  Its  flattened 
representation  is 

(I2f)  ;plus(X,Y.Z)  <- 

zero( 0  ^  succ( CX)  a  succ(X,Y)  a 
succ(Y,Z). 

Saturation  is  fired,  giving  the  clause : 

(I2fs) :  plus(X,Y.Z)  <- 

zero( 0  A  succ( CX)  a  succ(X,Y)  a 
succ(Y,Z)  A  plus(XX.Y). 

This  can  be  described  graphically  as  follows,  using  the 
same  conventions  as  in  figure  I: 

plus(X,Y,Z) 


zero(0  succ(CX)  succ(X,Y)  succ(Y2) 

Figure  3:  Saturated  graph  of  T+2  =  3' 

The  third  example  is  (13)  :  1  +  3  =  4.  The 
corresponding  flatten^  clause  is : 

(I3f) :  plus(X'X'.U‘)  <- 

zero(C)  A  succ(C'X')  a  succ(X',Y')  a 
succ(Y',Z’)  Asucc(Z',  U'). 

Saturation  provides  (firing  Ilfs  and  I2fs  on  the  body  of 
I3f)  the  clause : 

(I3fs) :  plus(X'X'.U')  <~ 

zero(z')  A  j«cc(f',X')  /\  succ{X',Y') 
A  succ(Y',Z')  A  succ(Z',  U')  A 
plus(X'X',Y')  A  plus(X'.Y',Z'), 


plus(X',Z',U’) 


Figure  4  :  Saturated  graph  of  T+3  =4' 

We  then  compare  the  two  graphs  that  represent  the 
saturated  clauses.  They  are  compared  because  their  two 
heads  match  (which  was  not  the  case  for  Ilf  and  120  in 
order  to  find  some  regularities  in  the  expression  of  the 
two  additions.  The  first  step  is  to  propose  the 


generalization  produced  by  Truncation  to  the  user.  If  the 
user  does  not  validate  the  generalization  on  its  own,  N- 
IRES  then  proceeds  to  Intraconstruction  (i.e.,  introduces  a 
new  predicate  to  cover  the  leftover  literals),  where  no 
induction  is  done. 

Generalizing  means  finding  a  partial  match  of  the  two 
graphs.  Finding  the  maximal  partial  match  is  np-hard, 
but  on  this  example,  we  are  going  to  show  that  we  are 
not  necessarily  interested  in  the  maximal  partial  match. 
The  heads  of  the  clauses  are  first  generalized,  and  then  the 
match  is  extended  by  exploring  in  priority  nodes  which 
are  near  the  root  node  (most  generd  generalizations  are 
privileged). 

plus(X'7'.U') 


Figure  5 :  Best  graph  matching  for  learning  addition 

The  geticralization  is : 

(G) :  plus{Afi,C)  *~ 

plus(AX>M)  A  succ(B,C)  A  succ(D^) 
A  zero  (C)  A  succiC'A). 

This  generalization  covers  completely  the  I2fs  graph, 
and  leaves  two  nodes  of  the  I3fs  graph  unmatched.  It  is 
near  from  the  expression  we  wish  to  learn,  but  it  still 
contains  irrelevant  terms. 

4.3  Connectivity 

Tnis  generalization  can  be  further  simplified  :  literals 
that  are  already  covered  by  some  terms  in  the 
generalization,  (such  as  zero(C)  and  succ(CA))  can  be 
dropped  from  the  generalization,  as  long  as  it  remains 
connected  (roughly,  there  must  exist  one  path  linking 
each  variable  in  the  body  of  the  generalization  to  one 
variable  that  appears  in  the  head  predicate  through 
predicates  in  the  clause).  This  can  be  called  the 
connectivity  heuristic  and  it  keeps  us  with  the  minimal 
links  (one  path)  between  variables  in  the  body  of  the 
generalization  and  variables  in  its  head). 

We  are  looking  for  relaxation  of  this  fairly  strong 
contraint.  A  similar  kind  of  constraint  has  been 
considered  in  (Helft ,  1988).  Using  simplifying  heuristics 
should  lead  us  to  drop  die  two  literals  zero(0  and 
succ(CA).  We  thus  obtain : 

plus(A£,C)  <- 

plus(AX>B)  A  succ(B,C)  asucc(DB) 

which  gives  once  unflattened: 
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plus(A,succ(D),succ(B))  <-  plus(AJ)^) .  This  means 
that  A+succ(D)  =  succ  (A+D).  We  recognize  here  a 
definition  ot  addition  in  the  general  case. 

5  Conclusion 

After  implementing  a  simple  solution  to  inversion  of 
resolution  when  the  representation  language  used  does  not 
include  function  symlwls  in  the  IRES  system,  we  noticed 
some  more  fundamental  limitations  to  the  inversion  of 
resolution  approach.  The  major  limit  affects  the 
Absorption  operator.  To  overcome  this  limit,  we  relate 
our  work  to  results  in  Logic  Programming  and  replace 
the  Absorption  operator  with  the  Saturation  operator  that 
makes  all  the  possible  deductions  on  the  body  of  one 
clause.  To  handle  efficiently  saturated  clauses,  we 
represent  them  in  the  form  of  a  graph  ordered  with  the 
subsumption  relation.  Intraconstruction  and  Truncation 
then  operate  on  these  new  structures  that  allow  to  reduce 
search  for  applicable  operator.  In  the  same  way,  the  graph 
structure  allow  to  easily  express  bias  to  constrain  the 
search  for  a  good  generalization  while  applying 
Saturation  and  Intraconstruction  (details  can  be  found  in 
(Rouveirol,  1990)). 

We  see  two  major  research  directions  at  this  point. 
First  of  all,  we  are  going  to  study  way  to  express  bias  to 
simplify  the  graphs  after  Saturation  (that  is,  how  to 
choose  among  all  the  potential  generalizations  contained 
in  a  saturated  clause  and  how  to  prune  the  irrelevant 
terms)  and  to  constrain  the  graph  matching  step  in 
Intreaconstruction.  A  second  important  issue  will  be  to 
evaluate  the  system,  and  in  particular  to  scale  up  to  large 
applications. 
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Abstract 

This  paper  intfoduces  a  new  programming 
methodology,  called  Genetic  Programming, 
which  is  the  application  of  the  Genetic 
Algorithm  to  the  evolution  of  the  signs  and 
weights  of  fully  (selO  connected  neural  network 
modules  which  perform  some  time  (in)dcpcndcnt 
function  (e.g.  walking,  oscillating  etc.)  in  an 
“optimal”  manner.  Genetically  Programmed 
Neural  Net  (GcnNet)  modules  are  of  two  types, 
functional  and  control.  A  series  of  functional 
GenNets  can  be  evolved,  and  their  weights  frozen. 
Control  GenNets  arc  then  evolved  whose  outputs 
are  the  inputs  of  the  functional  GenNets.  The  size 
and  timing  of  these  control  signals  are  evolved 
such  that  the  combination  of  control  and 
functional  GenNets  performs  as  desired.  This 
combination  can  then  be  frozen  and  used  as  a 
module  in  a  more  complex  structure.  This 
procedure  can  be  repeated  indefinitely,  thus 
allowing  the  construction  of  hierarchical  neural 
networks.  Genetic  Programming  has  recently 
proven  to  be  so  successful  that  the  building  of 
artificial  nervous  systems  becomes  a  real 
possibility. 

This  paper  illustrates  both  the  conceptual 
simplicity  and  the  power  of  Genetic 
Programming  by  showing  how  a  GcnNet  can  be 
evolved  which  teaches  a  pair  of  stick  legs  to 
walk.  This  is  followed  by  a  description  of  work 
in  progress  on  the  next  major  phase  of  Genetic 


Programming  research,  namely  the  building  of 
artificial  nervous  systems  (“brain  building”),  and 
on  the  tools  which  will  be  needed  to  evolve 
them,  called  Darwin  Machines. 


Introduction 

This  paper  shows  that  fully  (selQ  connected  neural 
networks  [RUMELHART  et  al  1986],  whose  weights  and 
signs  are  evolved  with  the  Genetic  Algorithm 
[GOLDBERG  1989],  are  powerful  enough  to  control  a 
system  which  is  highly  time  dependent  in  its  behaviour, 
e.g.  a  pair  of  walking  stick  legs.  Normally,  recurrent 
neural  networks  (i.e.  those  with  feedback  links  betwceii 
neurons)  require  a  certain  “settling  time”  for  the  output 
neurons  to  stabilize  their  output  signal  values  given  fixed 
(clamped)  input  values  to  the  input  neurons. 

What  is  interesting  in  the  experiments  in  this  paper  is 
that  the  inputs  change  faster  than  the  settling  time,  so  that 
the  neural  network  never  settles.  Under  such 
circumstances,  it  is  not  obvious  how  the  usual  neural 
network  learning  algorithms  can  be  applied  to  teach  such  a 
network  to  walk.  The  Genetic  Algorithm  (GA)  however 
docs  not  really  care  how  the  neural  network  performs  tliis 
task,  so  long  as  it  performs  it. 

Section  1  of  this  paper  summarises  briefly  the 
principles  of  the  Genetic  Algorithm  (GA).  Section  2 
introduces  the  concept  of  Genetic  Programming.  Section  3 
describes  the  particular  GenNet  used  and  how  the  GA  is 
applied  to  its  evolution.  Section  4  presents  the  results  of 
some  GenNet  experiments.  Section  5  discusses  ideas  for 
future  research  and  in  particular,  the  concepts  of  Brain 
Building  and  the  Darwin  Machine. 

1  The  Genetic  Algorithm 

The  Genetic  Algorithm  is  a  form  of  simulated 
evolution  to  solve  optimization  problems  in  a  Darwinian 
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"survival  of  the  fittest"  approach.  Solutions  to  problems 
are  coded  onto  (usually  binary)  strings  called 
"chromosomes",  (e.g.  parameter  values  in  a  control 
problem),  which  compete  with  each  other  to  reproduce  the 
next  generation.  A  quality  value  for  the  encoded  solution 
of  each  chromosome  is  determined,  and  the  probability  of 
reproduction  of  each  chromosome  into  the  next  generation 
is  proportional  to  this  value.  The  number  of  chromosomes 
per  generation  remains  fixed.  Genetic  operators  such  as 
mutation  (bit  flipping),  crossover  (cutting  two 
chromosomes  at  the  same  position  and  swapping 
portions),  inversion  (inverting  a  section  of  a 
chromosome).  Occasionally,  the  application  of  these 
operators  to  the  offspring  causes  them  to  have  higher 
quality  values  than  their  parents.  Hence  they  will 
reproduce  with  higher  probability,  and  squeeze  out  inferior 
chromosomes.  Over  lime,  the  average  quality  of  the 
population  will  increase.  The  GA  can  be  seen  as  a  form  of 
hill  climbing  where  there  may  be  many  hills  in  the 
configuration  space.  For  an  excellent  introduction  to  the 
principles  of  Genetic  Algorithms  see  [Goldberg  1989]. 

2  Genetic  Programming 

Genetic  Programming  is  a  new  programming 
methodology  which  uses  the  Genetic  Algorithm  to  design 
neural  network  modules  [de  GARIS  1990].  The 
programmer  specifies  the  behavioural  characteristics  that 
the  neural  network  should  possess,  the  number  of  neurons 
in  the  network  (which  will  be  fully  (sclO  connected), 
identifies  the  input  and  the  output  neurons,  the  quality 
criterion  for  the  performance  of  the  network,  the  number 
of  binary  places  after  the  binary  point  of  the  numbers 
which  represent  the  size  of  the  weights  connecting 
neurons,  the  initial  input  signal  values,  the  number  of 
cycles,  etc.  The  user  also  provides  a  list  of  parameters 
necessary  to  control  the  GA,  such  as  the  number  of 
generations,  the  size  of  the  population,  the  probability  of 
mutation,  the  size  of  the  scaling  factor  to  avoid  premature 
convergence,  etc.  A  concrete  example  is  shown  in  the  next 
two  sections. 

The  GA  is  then  used  to  find  both  the  signs  and  the 
values  of  the  weights  of  the  network  which  provides  the 
functionality  desired.  Once  these  weights  are  found,  they 
are  frozen  (fixed)  and  the  GenNet  module  thus  evolved  can 
be  used  as  a  component  m  a  more  complex  structure. 
Usually  a  set  of  GenNet  modules  consists  of  a  subset  of 
low  level  modules  which  execute  some  simple  functions. 
These  low  level  modules  are  usually  managed  by  control 
modules  which  are  them.selves  GenNcts.  i.e.  they  too  arc 
evolved  with  the  GA.  The  outputs  of  the  control  modules 
are  the  inputs  to  the  low  level  modules  (without  any 
intervening  weights).  Once  the  control  and  low  level 
modules  (considered  now  as  a  unit)  arc  functioning 
together  as  desired,  the  weights  of  the  control  modules  arc 
frozen  and  the  unit  can  then  be  considered  as  a  module  or 
component  for  an  even  more  complex  structure. 


If  one  can  analyze  a  complex  behaviour  (of  a  system) 
into  its  component  behaviours  and  their  corresponding 
control  inputs,  then  Genetic  Programming  is  a  useful 
approach.  One  can  solve  problems  in  this  way  which  arc 
too  complex  to  solve  with  traditional  algorithmic 
approaches. 
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One  of  the  aims  of  this  paper,  as  mentioned  in  the 
abstract,  is  to  show  that  GenNets  are  powerful  enough  to 
provide  highly  time  dependent  control.  The  vehicle  used  to 
show  this  possibility  is  a  simple  pair  of  stick  legs,  which 
is  to  be  taught  to  walk.  FIG.  1  shows  the  basic  setup. 
The  output  values  of  the  GenNet  are  interpreted  to  be  the 
angular  accelerations  of  the  four  components  of  the  legs. 


HIP  JOINT 


ANGVEL  )  • 


CO^#€C^ED 


ANG  ACCEL  ) 


ANGACCEL2 


ANG  ACCEL  3 


ANG  ACCEL  4 


Knnwinp  the  values  of  the  ancular  accelerations 
(assumed  constant  over  one  cycle  -  where  a  cycle  is  the 
time  period  over  which  the  neurons  calculate 
(synchronously)  their  outputs  from  their  inputs),  and 
knowing  the  values  of  the  angles  and  the  angular 
velocities  at  the  beginning  of  the  cycle,  one  can  calculate 
the  values  of  the  angles  and  the  angular  velocities  at  the 
end  of  the  cycle. 
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As  input  to  the  GenNct  (control  module)  were  chosen 
the  angles  and  the  angular  velocities.  FIG.2  shows  how 
this  feedback  works.  Knowing  the  angles,  one  can  readily 
calculate  the  positions  of  the  two  “feet”  of  the  stick  legs. 
Whenever  one  of  the  feet  becomes  lower  than  the  other, 
that  foot  is  said  to  be  “on  the  ground”,  and  the  distance 
(whether  positive  or  negative)  between  the  positions  of  the 
newly  grounded  foot  and  the  previously  grounded  foot  is 
cslculst^d. 

The  aim  of  the  exercise  is  to  evolve  GenNets  which 
make  the  stick  legs  move  as  far  as  possible  to  the  right  in 
the  user  specified  number  of  cycles,  and  cycle  time.  The 
GenNet  used  here  consists  of  12  neurons;  8  input  neurons 
and  4  output  neurons  (no  hidden  neurons).  The  8  input 
neurons  have  as  inputs  ^e  values  of  the  4  angles  and  the  4 
angular  velocities.  The  input  angles  range  from  -1  to  +1, 
where  +1  means  one  half  turn  (i.e.  180  degrees).  The 
initial  (start)  angles  are  chosen  randomly,  ranging  between 
0  and  1.  The  initial  angular  velocities  are  chosen  randomly 
between  -1  to  +1,  and  are  given  in  half  turns  per  secon 

The  activity  of  a  neuron  is  calculated  in  the  usual  way, 
namely  the  sum  of  the  products  of  its  inputs  and  its 
weights,  where  weight  values  range  from  -1  to  +1.  The 
output  of  a  neuron  is  calculated  from  this  sum,  using  the 
symmcuical  sigmoid  function 

(-1  +  (2/(1  +  exp(-sum)))),  which  ranges  fror  -1  to  +1. 
The  outputs  of  neurons  are  restricted  to  have  absolute 
values  of  less  than  i  so  as  to  avoid  the  risk  of  explosive 
positive  feedback. 

The  chromosomes  used  to  evolve  the  weights  and  their 
signs  in  the  GA  are  simple  binary  strings.  The  user 
specifies  the  number  P  of  binary  places  after  the  binary 
point  of  the  numbers  representing  the  values  of  the 
weights  (where  weights  have  an  absolute  value  less  than 
1).  Imagine  this  is  6.  One  bit  is  used  per  weight  to  specify 
the  sign  of  the  weight  (0  is  positive,  i.e.  an  excitory 
“synapse”,  1  is  negative,  i.e.  an  inhibitory  “synapse”). 
Thus,  for  a  GenNet  of  N  neurons,  a  chromosome  will  be 
N*N*(P  +  1)  bits  long,  since  there  are  N*N  weights  in  a 
fully  (selO  connected  network. 

The  initial  population  of  chromosomes  is  generated 
randomly  with  each  bit  equally  likely  to  be  a  0  or  a  1.  The 
ith  group  of  (P  +  1)  bits  corresponds  to  the  sign  bit 
followed  by  the  P  bits  giving  the  weight  value  of  the  ith 
weight,  e.g.  100101  represents  a  weight  of  -0.15625. 

For  our  experiments,  no  crossover  was  used.  GenNet 
neurons  are  so  highly  interdependent  that  crossing  over 
chromosome  portions  is  detrimental.  GenNet  GAs 
typically  function  with  no  sex  (i.e.  no  crossover)  and  no 
inversion,  using  only  mutation  and  selection.  The 
selection  technique  used  was  standard  “roulette  wheel”,  (see 
[GOLDBERG  1989]). 

The  quality  criterion  used  for  selecting  the  next 
generation  was  usually  the  total  distance  covered  by  the 
stick  legs  in  the  total  time  T,  where  (T  =  C*cyclctimc)  for 
the  user  specified  number  C  of  cycles  and  cyclctimc.  The 
quality  is  thus  the  velocity  of  the  slick  legs  moving  to  the 
right.  Right  distances  are  non  negative.  Stick  legs  which 


moved  to  the  left  scored  zero  and  were  eliminated  after  the 
first  generation. 

4  Results 

A  series  of  experiments  was  undertaken.  In  the  first 
experiment,  no  constraint  was  imposed  on  the  motion  of 
the  stick  legs  (except  for  a  sclccuuu  lo  move  right  across 
the  screen).  The  resulting  motion  was  most  un-lifelike.  It 
consisted  of  a  curious  mixture  of  windmilling  of  the  legs 
and  strange  contortions  of  the  hip  and  knee  joints. 
However  it  certainly  moved  well  to  the  right,  starting  at 
random  angles  and  angular  velocities.  As  the  distance 
covered  increased,  the  speed  of  the  motion  increased  as 
well  and  became  more  "efficient",  e.g.  windmilling  was 
squashed  to  a  "swimmers  stroke". 

In  the  second  experiment,  the  stick  legs  had  to  move 
such  that  the  hip  joint  remained  above  the  floor  (a  line 
drawn  on  the  screen).  During  the  evolution,  if  the  hip 
joint  did  hit  the  floor,  evolution  ceased,  the  total  distance 
covered  was  frozen,  and  no  further  cycles  were  executed. 
After  every  cycle,  the  coordinates  of  the  two  feet  were 
calculated  and  a  check  made  to  sec  if  the  hip  joint  did  not 
lie  below  both  feet.  This  time  the  evolution  was  slower, 
presumably  because  it  was  harder  to  find  new  weights 
which  led  to  a  motion  satisfying  the  constraints.  The 
resulting  m''tion  was  almost  as  un-lifclikc  as  in  the  first 
oxperime  again  with  windmilling  and  contortions,  but 
at  least  the  hip  joint  remained  above  the  floor. 

In  the  third  experiment,  a  full  set  of  constraints  was 
imposed  to  ensure  a  lifelike  walking  motion.  The  result 
was  that  the  stick  legs  moved  so  as  to  take  as  long  a 
single  step  as  possible,  and  "did  the  solits"  with  the  two 
legs  as  extended  as  possible  and  th .  ..  nt  just  above 
the  floor.  From  this  position,  it  WP'  '  ss  jle  to  move 
any  further.  Evolution  ceased.  T  .luable  lesson 

and  focussed  attention  upon  u.c  l  concept  of 

"cvolvability",  i.e.  the  capacity  for  ,aither  evolution.  A 
change  in  approach  was  ne^ed. 

This  led  lo  the  concept  of  Behavioural  Memory,  i.e. 
the  tendency  of  a  behaviour  which  was  evolved  in  an 
earlier  phase  lo  persist  in  a  later  phase.  For  example,  in 
phase  1,  evolve  a  GenNet  for  behaviour  A.  Take  the 
weights  and  signs  resulting  from  this  evolution  as  initial 
values  for  a  second  phase  which  evolves  behaviour  B.  One 
notices  that  behaviour  B  contains  a  vestige  of  behaviour 
A.  This  form  of  Sequential  Evolution  can  be  very  useful 
in  Genetic  Programming,  and  was  used  to  teach  the  stick 
legs  to  walk,  i.e.  to  move  to  the  right  in  a  step  like 
manner,  in  a  3  evolutionary  phases. 

In  the  first  phase,  a  GenNet  was  evolved  over  a  short 
time  period  (e.g.  100  cycles)  which  look  the  stick  legs 
from  an  initial  configuration  of  left  leg  in  front  of  right 
leg  to  the  reverse  configuration.  The  angular  velocities 
were  all  zero  at  the  start  and  at  the  finish.  By  then 
allowing  the  resulting  GenNet  to  run  for  a  longer  time 
period  (e.g.  200  cycles)  the  resulting  motion  was  “step¬ 
like”,  i.e.  one  foot  moved  in  front  of  and  behind  the  foot 
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on  the  floor,  but  did  not  touch  the  floor.  The  quality 
measure  for  this  GenNet  was  the  inverse  of  the  sum  of  the 
squares  of  the  differences  between  the  desired  and  the  actual 
output  vector  components  (treating  the  final  values  as  an  8 
component  state  vector  of  angles  and  angular  velocities). 

This  GenNet  was  then  taken  as  input  to  a  second 
(sequentiu!  cvoluuonaf^)  phoac  was  to  gel  the  stick 
legs  to  take  short  steps.  The  quality  measure  this  ti.  .  was 
the  product  of  the  number  of  net  positive  steps  taken  to 
the  right,  and  the  distance.  The  result  was  a  definite 
stepping  motion  to  the  right  but  distance  covered  was  very 
small.  The  resulting  GenNet  was  used  in  a  third  phase  in 
which  the  quality  measure  was  simply  the  distance 
covered.  This  time  the  motion  was  not  only  a  very  definite 
stepping  motion,  but  the  strides  taken  were  long.  The 
stick  legs  were  walking. 

The  above  experiments  were  all  performed  in  tens  to  a 
few  hundred  generations  of  the  GA.  Typical  GA  parameter 
values  were  -  population  size  =  50;  mutation  rate  =  0.001; 
scaling  factor  =  2.0  (a  linear  scaling  of  quality  scores  in 
the  GA  (max.value  =  sc.  fact.*av. value),  which  prevents 
premature  convergence);  cycletime  =  ().03;  number  of 
cycles  =  200  to  400. 

A  video  of  the  results  of  the  above  experiments  has 
been  made.  To  obtain  a  real  feel  for  the  evolution  of  the 
motion  of  the  stick  legs,  one  really  needs  to  sec  it. 

5  Future  Research  •  Brain  Building  and 
the  Darwin  Machine 


The  great  advantage  of  Genetic  Programming  in 
comparison  to  traditional  neural  network  learning  schemes 
(e.g.  the  popular  backpropagation  or  backprop  method 
[RUMELHART  et  al  1986])  is  that  it  can  be 
unsupervised,  i.e.  one  need  not  know  the  desired  output 
vectors  of  the  neural  network.  If  one  is  dealing  with  a  time 
dependent  phenomenon  in  a  supervised  method,  one  needs 
to  know  the  desired  ouqrut  vectors  as  a  function  of  lime, 
which  implies  that  one  already  understands  the 
phenomenon  well  enough  to  be  able  to  know  what  the 
outputs  should  be.  With  Genetic  Programming,  the  only 
thing  that  is  necessary  is  that  one  has  a  quality  measure 
(usually  a  scalar)  of  the  performance  as  a  whole,  which  can 
be  fed  back  to  the  device  being  Genetically  Programmed. 

The  success  of  these  experiments  raises  the 
fascinating  prospect  of  building  artificial  nervous  systems 
("brain  building"),  i.e.  combining  functional  and  control 
GenNets  to  build  simple  brains. 

Obviously,  if  one  can  evoi.c  one  GenNet,  one  can 
evolve  many,  eacu  naviu^  iiS  own  charactcnstic  properties 
and  behaviour.  By  combining  the  actions  of  many 
GenNets,  it  seems  likely  that  it  will  be  possible  to  build 
artificial  nervous  systems  or  "nanobrains".  The  term 
"nanobrain"  is  appropriate,  considering  that  the  human 
brain  contains  some  trillion  neurons,  and  that  a  nanobrain 
would  therefore  contain  the  order  of  hundreds  of  neurons. 
Simulating  several  hundred  neurons  is  roughly  the  limit  of 


present  day  technology,  but  this  will  change.  A  discussion 
of  future  prospects  in  this  regard  will  follow  later. 

This  section  is  concerned  principally  with  the 
description  of  the  simulation  of  just  such  a  nanobrain.  The 
point  of  the  exercise  is  to  show  that  this  kind  of  thing  can 
be  done,  and  if  it  can  be  done  successfully  witli  a  mere 
dozen  or  so  GenNets  as  building  blocks,  it  will  later  be 
possible  to  design  artificial  nervous  systems  with  GenNets 
numbering  in  the  hundreds,  thousands  and  up.  One  can 
imagine  in  the  near  future,  whole  teams  of  human  Genetic 
Programmers  devoted  to  the  task  of  building  quite 
sophisticated  nano  (micro?)  brains,  capable  of  an  elaborate 
behaviour  range. 

To  show  that  this  is  no  pipe  dream,  a  concrete 
proposal  will  now  be  presented  showing  how  GenNets 
can  be  combined  to  form  a  functioning  (simulated) 
artificial  nervous  system.  The  implementation  of  this 
proposal  has  not  yet  been  completed  (at  the  time  of 
writing),  but  progress  so  far  inspires  confidence  that  the 
project  will  be  completed  successfully.  The  "vehicle" 
chosen  to  illustrate  this  endeavour  is  shown  in  FIG.  3. 


FIG.  3 

This  lizard-like  creature  (called  LIZZY)  consists  of  a 
rectangular  wire-frame  body,  four  two-part  legs  and  a  fixed 
antenna  in  the  form  of  a  V.  LIZZY  is  capable  of  reacting 
to  three  kinds  of  creature  in  its  environment,  namely  :- 
mates,  predators  and  prey.  These  three  categories  arc 
represented  by  appropriate  symbols  on  the  simulator 
screen.  Each  category  emits  a  sinusoidal  signal  of  a 
characteristic  frequency.  The  amplitudes  of  all  these 
signals  decrease  inversely  as  a  function  of  distance.  Prey 
emit  a  high  Ircquency,  males  a  middie  frequency,  and 
predators  a  low  frequency. 

The  antenna  picks  up  the  signal  continuously.  Once 
the  signal  strength  becomes  large  enough  (a  value  called 
the  "aUeniion"  threshold),  LIZZY  detects  the  frequency  of 
this  signal,  and  depending  upon  the  outcome,  executes  an 
appropriate  sequence  of  actions.  If  the  object  is  a  prey, 
LIZZY  rotates  toward  it,  moves  in  the  direction  of  the 
object  until  the  signal  strength  is  a  maximum,  stops. 
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alligns  its  legs,  pecks  at  the  prey  like  a  hen  (by  pushing 
the  front  of  its  body  up  and  down  with  its  front  legs),  and 
after  a  while,  moves  away  randomly.  If  the  object  is  a 
predator,  LIZZY  rotates  away  from  it.  and  flees  until  the 
signal  strength  is  below  the  attention  threshold.  If  the 
object  is  a  mate,  LIZZY  rotates  toward  it,  moves  in  the 
direction  of  the  object  until  the  signal  strength  is  a 
maximum,  stops,  alligns  its  legs,  mates  (by  pushing  the 
back  of  its  body  up  and  down  with  its  back  legs),  and  after 
a  while,  moves  away  randomly. 

The  above  is  merely  a  sketch  of  LI2ZY's  behavioural 
repertoire.  In  order  to  allow  LIZZY  to  execute  these 
behaviours,  a  detailed  circuit  of  GenNets  and  their 
connections  needs  to  be  designed.  FIG.  4  shows  an  initial 
attempt  at  designing  such  a  circuit.  The  black  dots  indicate 
a  blocker  function,  i.e.  if  the  signal  is  active  (non  zero), 
then  the  traversing  signal  is  blocked.  There  are  7  different 
motion  GenNets,  of  which  "random  move"  is  the  default 
option.  When  any  of  the  other  6  motions  is  switched  on, 
the  random  move  is  switched  off  by  a  "blocker".  LIZZY 
moves  in  the  following  way.  Each  motion  GcnNet 
consists  of  8  output  neurons,  whose  output  values  (as  in 
the  WALKER  GenNet)  are  interpreted  as  angular 
acceleratations  (rather  than  as  angles,  which  would  pose 
continuity  problems  when  switching  from  one  motion 
GenNet  to  anotlier).  The  position  of  the  upper  (and  lower) 
part  of  each  leg  is  defined  by  the  angle  that  the  leg  line 
takes  on  a  cone  whose  axis  of  symmetry  is  equidistant 
from  each  of  the  XYZ  axes  defined  by  the  body  frame.The 
leg  line  is  confined  to  rotate  around  the  surface  of  this 
cone.  This  approach  was  chosen  so  as  to  limit  the  number 
of  degrees  of  freedom.  If  each  leg  part  had  two  instead  of 
one  degree  of  freedom,  and  one  wanted  to  keep  the  GenNet 
fully  connected,  with  16  outputs  and  32  inputs  (angles, 
and  angular  velocities),  i.e.  48  neurons  per  GenNet,  hence 
48  squared  connections,  the  resulting  chromosome  would 
have  been  huge. 

LIZZY'S  motion  is  determined  as  follows.  One 
calculates  the  positions  of  the  legs  relative  to  the  body 
frame.  LIZZY  as  a  whole  is  raised  (or  lowered)  vertically 
so  that  tlic  lowest  "foot"  touches  the  "floor"  at  a  point 
called  the  "base  point".  LIZZY  is  then  rotated  about  the 
base  point  in  the  vertical  plane  cutting  the  base  point  and 
the  next  lowest  "foot"  until  this  second  foot  touches  the 
floor  at  the  second  base  point.  LIZZY  is  then  rotated  about 
the  line  joining  the  two  base  points  until  the  third  lowest 
foot  touches  the  floor.  The  fourth  foot  will  normally  be 
raised  relative  to  the  plane  defined  by  the  first  three 
feetWhen  one  motion  GenNet  is  switched  off  and  another 
switched  on,  the  current  angles  and  angular  velocities  are 
Iiipul  iu  iiic  11I..W  GuiNel,  tu  uisure  eoiiiinuiiy  of  moduli. 

The  power  and  the  fun  of  GenNets  is  that  one  can 
evolve  a  motion  without  specifying  in  detail,  how  the 
motion  is  to  be  performed.  For  example,  the  fitness  (or 
quality)  measure  for  the  "move  forward"  GenNet  will  be  a 
simple  function  of  the  total  distance  covered  in  a  given 
number  of  cycles,  (modified  by  a  measure  of  its  total 
rotation  and  its  drift  away  from  the  desired  sUaighl  ahead 


direction).  Similar  fitness  measures  can  be  defined  for  the 
two  rotation  GenNets.  (In  fact  only  one  needs  to  be 
evolved,  and  one  can  then  simply  switch  body  side 
connections,  for  reasons  of  symmetry). 

The  "allign  legs '  is  used  to  position  the  legs  rea4y  for 
either  pecking  or  mating.  Since  rotations  and  move 
forward  are  automatically  switched  on  and  off  by  the  input 
from  the  antenna,  there  is  no  need  for  timeouts,  as  is  the 
case  for  the  allign  legs,  peck  prey  and  mate.  It  is  not  too 
difficult  to  design  fitness  measures  for  each  of  these 
GenNets,  and  since  they  are  more  or  less  single  function 
GenNets  (sec  below),  their  evolution  should  not  present 
problems,  especially  after  the  experience  gained  from 
implementing  the  WALKER  GenNet. 

Understanding  the  GenNet  circuit  shown  in  FIG.  4  is 
fairly  straightforward,  except  perhaps  for  the  mechanism  of 
rotating  toward  or  away  from  the  object.  The  antenna  is 
fixed  on  the  body  frarric  of  LIZZY.  It  is  assumed  in  this 
simulation  that  the  object  is  always  placed  initially 
(approximately)  in  front  of  LIZZY'S  body  (rather  than 
behind  it).  Hence  if  the  signal  strength  on  the  left  antenna 
is  larger  than  that  on  the  right,  and  if  the  object  is  a  prey, 
then  LIZZY  is  to  turn  towards  it  by  rotating  clockwise,  or 
away  from  it  by  rotating  anticlockwise  if  the  object  is  a 
predator.  Eventually,  the  two  signal  strengths  will  become 
approximately  equal,  because  the  two  antenna  will  be 
equidistant  from  the  object.  When  this  happens,  LIZZY 
moves  forward.  Admittedly,  since  FIG.4  is  only  a  plan  for 
future  research,  there  will  probably  be  conceptual  and 
logical  enors  in  the  design.  But  I  am  confident  that  it  will 
be  realizable.  When  it  does  work,  a  video  will  be  made  of 
its  antics.  It  will  be  interesting  to  see  just  how  life  like  it 
will  appear  to  be. 

Implementing  a  score  or  so  of  different  GenNets, 
(assuming  a  generation  time  of  the  order  of  a  minute  or 
less  for  several  hundred  generations  per  GenNet)  makes 
one  conscious  of  the  need  for  Genetic  Programming  tools. 
Initially  this  could  take  the  form  of  a  software  package 
which  could  ask  the  user  for  the  values  of  the  parameters 
needed  for  the  GenNet  to  be  evolved  (e.g.  the  number  of 
neurons,  the  input/output  neurons,  the  fitness  measure, 
the  number  of  binary  places  for  the  weights  etc)  and  the 
parameters  for  the  Genetic  Algorithm  (e.g.  population 
size,  mutation  probability,  scaling  factor  etc).  However,  in 
the  age  of  VLSI,  one  can  readily  imagine  VLSI  chips 
being  designed  which  perform  the  same  task  as  the 
software  package,  but  much  faster.  I  call  such  a  tool,  a 
Darwin  Machine.  With  suitable  Darwin  Machines,  one  can 
imagine  teams  of  Genetic  Programmers  undertaking  large 
scale  projects  to  build  much  more  sophisticated  nervous 
s>si.i,iiis  wiiii  many  tliousaiius  of  GenNets.  One  ean 
imagine  GenNet  companies  being  established  which 
supply  Genetic  Programmers  with  catalogues  containing 
detailed  descriptions  of  tlien  companys'  GenNets.  In  time, 
whole  GenNet  subsystems  could  be  sold  as  units. 

The  empirical  "engineering"  approach  to  brain  building 
ought  to  attract  the  attention  of  the  neuro  biologists.  As 
GenNets  become  more  biologically  realistic,  the  solutions 
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found  by  Genetic  Programmers  to  concrete  problems  of 
nervous  system  design  may  inspire  the  neuro  biologists  to 
perform  new  experiments.  We  may  see  a  much  closer 
collaboration  between  theoretical  and  experimental  neuro 
biology  in  the  near  future. 

Despite  the  aiiraction  of  the  GenNet  with  its  incredible 
ability  to  evolve  almost  any  desired  limc  (ir.)depcndcnt 
output,  there  are  some  inherent  limitations  in  the  Genetic 
Programming  approach.  These  limitations  are  of  two 
types,  intra  GenNet  and  inter  GenNet.  Intra  GenNet 
limitations  concern  what  an  individual  GenNet  cannot  do. 
There  is  now  a  crying  need  for  a  new  branch  of  theoretical 
(mathematical)  neuro  dynamics  to  consider  what 
behaviours  GenNets  can  and  cannot  evolve.  At  the 
moment,  I  am  working  blind,  and  am  guided  more  by  an 
empirical  "Fingerspitzengefuhl"  than  any  theoretical 
guidelines.  Genetic  Programming  at  the  present  lime  is 
very  much  an  art. 

After  having  played  with  GenNets  for  nearly  a  year 
now,  what  have  I  learned  about  intra  GenNet  limitations, 
or  at  least  GenNet  evolutionary  tendencies?  These  lessons 
can  be  summarized  as  follows 

a)  Use  shaping.  Don't  try  to  evolve  the  whole  behaviour 
in  one  step  -  break  it  up  (c.g.  if  you  try  to  evolve  a  full 
sinusoid  cycle  in  one  go,  you  will  probably  get  a  straight 
line  as  a  result.  Instead,  evolve  a  half  sinusoid,  and  lake 
the  resulting  GenNet  as  a  starter  for  the  full  sinusoid). 

b)  When  shaping,  make  sure  your  output  curve  has  non 
zero  slope  during  the  last  few  cycles.  GenNets  tend  to  find 
stable  (i.e.  constant)  outputs  fairly  easily,  which  is  fine  if 
that  is  what  you  want,  but  can  block  further  evolution,  if 
the  output  needs  to  change  during  later  cycles. 

c)  Avoid  multi-function  GenNets.  I  was  curious  to  see  if 
it  was  possible  to  evolve  a  GenNet  which  performs  two 
functions,  e.g.  to  get  the  slick  legs  to  move  either  left  or 
right  depending  upon  a  change  of  input  signal  on  one 
input  neuron.  Yes,  it  is  possible.  I  got  it  to  move  both 
ways.  But  when  I  tried  to  get  two  separate  full  cycle 
sinusoids,  I  failed.  I  got  two  half  sinusoids  (of  different 
frequencies)  without  any  problem,  but  1  could  not  get  two 
adequate  full  sinusoids.  I  lost  two  weeks  work  trying  to  do 
that,  and  eventually  gave  up. 

(J)  Keep  the  number  of  evolutionary  cycles  small.  It  is 
harder  to  evolve  a  desired  output  over  a  large  number  of 
cyeles  than  over  a  small  number.  A  large  number  increases 
the  generation  time  loo,  of  course, 
e)  Weight  your  fitness  measure.  If  you  want  to  "push" 
your  evolving  output  curve  in  a  given  direction,  weight 
the  errors  of  the  later  cycles  more  heavily. 

0  Chaos  seems  to  be  no  problem.  In  the  early 
generations,  I  often  got  what  looked  like  chaotic  output 
(not  surprising,  considering  that  GenNets  are  highly  non 
linear,  time  dependent  systems),  but  since  these  solutions 
had  very  low  fitness  scores,  they  were  quickly  eliminated. 

Inter  GenNet  limitations  arc  even  more  of  a  challenge. 
Future  technologies  such  as  (Wafer  Scale  Integration 
(WSI)  [RUDNICK  el  al  1989],  Molecular  Electronics 
[REED  1988],  Nano-technology  [DREXLER  1986], 


[LANGTON  1989].  [SCHNEIKER  1989])  and  even 
quantum  computing,  will  give  humanity  the  means  to 
build  machines  with  [m,  b,  tr,  ...]illions  of  GenNets 
within  a  human  generation.  How  on  earth  are  we  going  to 
know  how  to  put  them  all  together?  Once  I  have 
finished  getting  LIZZY  to  work,  I  would  like  to  try  three 
things.  The  first  is  to  give  GenNets  the  capacity  to  learn. 
LlliZY  cdiinot  adapt  its  behaviour.  Its  reactions  are  merely 
"instinctual".  The  second  thing  is  to  attempt  to 
Genetically  Program  robots.  There  is  a  limit  to  what  one 
can  do  with  simulations  and  it  seems  to  me  that 
Genetically  Programming  robots  is  a  goal  which  should 
be  aimed  at.  The  third  thing  is  to  attempt  to  get  the 
Genetic  Algorithm  itself,  rather  than  a  human  Genetic 
Programmer,  to  connect  the  GenNets.  At  this  point,  the 
Darwin  Machine  could  really  come  into  its  own.  One  can 
imagine  populations  of  simulated  "creatures"  competing 
in  an  environment  consisting  of  other  creatures,  all  of 
which  are  "consuiicted"  with  the  GA.  The  fitness  measure 
would  be  truly  Darwinian.  Real  time  Darwin  machines 
could  be  pul  into  robots  and  a  sequence  of  experiments 
undertaken  on  the  same  machine.  The  GA  population 
would  be  derived  from  the  stored  results  of  each 
experiment  in  the  sequence.  With  nanotechnology,  zillions 
of  nanorobots  ("nanots"),  could  function  in  parallel  in  a 
molecular  environment,  and  report  back  to  some  cenual 
molecular  processor,  which  determines  the  next 
generation.  Nanotechnology  will  probably  be  a  reality 
before  the  end  of  my  research  career,  and  considering  the 
staggering  potential  complexity  of  nanobased  devices,  I 
believe  that  an  evolutionary  approach  to  its  foundation 
will  be  inevitable.  The  complexity  will  be  such,  that  a 
system  will  have  to  be  treated  as  a  black  box,  whose 
performance  only  is  of  concern.  This  approach  lies  at  the 
heart  of  Genetic  Programming.  I  neither  know  nor  care 
about  what  takes  place  inside  a  GenNet.  I  merely  mutate 
and  select,  mutate  and  select.  In  a  future  of  overpoweringly 
complex  systems,  there  may  be  no  other  way.  It  is  the 
approach  that  nature  uses.  Maybe  we  should  loo. 
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Abstract 

Genetic  Algorithms  are  generally  compute 
intensive  procedures  that  require  the  evalua¬ 
tion  of  many  candidate  solutions  to  a  given 
problem.  In  the  application  area  we  study 
(routing  and  scheduling),  the  genetic  algo¬ 
rithm  sets  parameters  for  a  mathematical 
heuristic.  To  reduce  the  computational  over¬ 
head  of  this  approach,  we  developed  three 
mechanisms  for  improving  the  performance 
of  the  genetic  search.  First,  we  employ  a 
method  of  using  multiple  sharing  evaluation 
functions,  permitting  the  parallel  investiga¬ 
tion  of  multiple  peaks  in  the  search  space. 
Second,  we  carry  out  function  evaluation  in 
parallel,  using  a  network  of  heterogeneous 
processors.  Third,  a  neural  network  system  is 
employed  to  inject  heuristic  knowledge  into 
the  initial  population  of  the  genetic  algo¬ 
rithm,  resulting  in  relatively  fast  conver¬ 
gence.  When  the  methods  are  used  together, 
tlie  result  is  high  quality  solutions  with  con¬ 
siderable  speedup  in  computational  time. 

Our  overall  system,  called  XVl'P-PGA,  is  a 
distributed  software  system  that  •  lemonstrates 
the  increased  efficiency  of  genetic  algoritlims 
in  the  automated  discovery  of  parameters  for 
mathematical  heuristics,  in  the  domain  of 
computer-aided  vehicle  routing  and  schedul¬ 
ing  problems. 

1  Introduction 

III  tiic  past  few  years  many  researehers  iiave  investi¬ 
gated  ways  of  improving  the  performance  of  Genetic  Al¬ 
gorithms  (GA)  (Holland,  1975).  GA’s  have  shown  to  be 
useful  in  many  optimization  problems.  GA’s  are  general¬ 
ly  compute  intensive.  Methods  of  improving  tlie  perfor¬ 
mance  and  efficiency  of  GA’s  are  of  significant  impor¬ 
tance. 


The  primary  parameters  of  a  standard  GA  are  popula¬ 
tion  size,  crossover  and  mutation  rates,  and  number  of 
crossover  points.  These  parameters  have  a  significant 
impact  on  performance  (Schaffer,  1989;  Goldberg,  1989; 
Jog,  1989).  Adaptive  selection  methods  (Baker,  1985) 
and  reproductive  evaluation  techniques  (^itley,  1987) 
have  also  been  shown  to  speed  up  GA  searches.  Parallel 
implementations  of  a  GA  have  demonstrated  consider¬ 
able  speedup  (Grefenstette,  81). 

The  Vehicle  Routing  Problem  (VRP)  is  a  highly  com¬ 
binatorial  problem  (NP-Hard)  that  has  been  extensively 
studied.  In  the  VRP,  there  is  a  known  collection  of  stop 
points  that  have  demands  for  service,  and  a  fixed  fleet  of 
limited  capacity  vehicles  to  serve  the  stops.  The  problem 
is  to  find  the  minimum  distance  way  to  assign  the  stops  to 
vehicles  and  specify  the  orders  in  which  each  of  the  vehi¬ 
cles  visits  its  stops.  All  the  vehicles  begin  and  end  their 
tours  at  a  fixed  location  depot  Researchers  have 
developed  many  heuristic  strategies  for  these  problems 
IBo'jin  et  al.,  1983).  A  major  .shortcoming  of  these 
houiistics  is  the  inability  to  deduce  the  kind  of  strategy  to 
adopt,  given  the  characteristics  of  the  problem  at  hand. 
The  need  to  discover  the  parameters  to  use  in  a  heuristic 
for  a  given  problem  instance,  led  to  the  identification  and 
use  of  GA  techniques. 

This  paper  describes  ways  to  use  GA’s  more  effec¬ 
tively  to  discover  parameters  for  mathematical  heuristics, 
in  the  domain  of  computer-aided  vehicle  routing  and 
scheduling  problems.  A  method  of  using  multiple  sharing 
evaluation  functions  is  described,  permitting  the  parallel 
investigation  of  multiple  peaks  in  the  search  space.  Limi¬ 
tations  of  single  processor  systems  and  the  increasing 
availability  of  networked  heterogeneous  workstations,  led 
to  investigations  of  utilizing  the  abundance  of  unused 
CPU  cycles  in  liie  nciwoik  in  order  to  overcome  die  long 
search  time  of  the  GA’s.  In  most  GA’s  the  initial  popu¬ 
lation  consists  of  entirely  random  structures.  In  routing 
problems,  it  is  possible  to  formulate  reasonably  good 
structures  for  the  initial  population  using  a  trained  neural 
network  system. 
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2  Functional  Description  of  the  Multiple 
Sharing  Evaluation  Functions 

In  the  parameter  discovery  task  we  seek  a  set  of 
parameters,  under  the  guidance  of  a  evaluation  function. 
GA’s  use  bit  strings  as  chromosonal  encodings  of  param¬ 
eters  of  the  problem  they  are  trying  to  solve.  The  control 
parameters  in  our  case  are  seed-point  locations.  We 
model  each  vehicle  tour  with  a  location  called  a  seed, 
with  the  vehicle  conceptually  traveling  from  the  depot  to 
the  seed  and  back.  The  seed-points  represent  an  average 
location  around  which  the  truck  serves.  We  use  binary 
bit  strings  to  encode  the  seed-points.  The  number  of  seed 
points  recommended  by  the  GA  is  equal  to  the  number  of 
vehicles  available.  The  example  string  shown  below 
represents  one  seed-point,  each  with  an  x  and  y  coordi¬ 
nate,  shown  in  both  binary  and  decimal. 

String:  OOOOlUilOO  1010110010 

Parameter:  Sxi  Syi 

Decoded  Value:  55  803 
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Figure  1:  Unified  Architecture  of  the 
overall  system.  The  same  control  parameters 
are  supplied  to  the  three  different  clustering 
heuristics  (FAAl,  FAA2  and  FGAA). 


2.1  The  Evaluation  functions 

We  use  a  "cluster  first  route  second"  heuristic  tech¬ 
nique  as  a  VRP  solver  in  the  evaluation  function.  There 
are  two  fundamental  decisions  that  have  to  made  before 
using  this  heuristic.  First,  we  need  to  determine  the  clus¬ 
ters  or  groups  of  stops  served  by  the  same  vehicle. 
Second,  we  need  to  efficiently  sequence  the  stop  loca¬ 
tions  of  each  vehicle.  We  experimented  with  four 
methods  of  determining  the  clusters.  The  first  two 
methods.  Fast  Assignment  Approaches  FAAl  and  FAA2, 


arc  new  algorithms  that  are  relatively  fast  and  well  suited 
for  the  genetic  search.  The  third  method,  FGAA,  is  a 
modified  version  of  the  generalized  assignment  method 
developed  by  Fisher  and  Jaikumar  (1981).  The  fourth 
method  is  a  combination  method,  which  uses  the  same 
GA  recommended  seed  points  for  the  three  methods 
mentioned  above.  Each  of  the  method  is  described  below. 
Clustering  Method  1  [FAAl]:  In  this  method,  one 
seed-point  is  active  at  a  time.  The  nearest  stop  is  as¬ 
signed  to  the  active  seed-point,  if  doing  so  does  not 
violate  the  corresponding  capacity  constraint.  For  each 
stop  assigned,  a  weighted  distance  factor  is  added  to  the 
active  seed-point  The  seed-point  with  the  minimum 
weighted  distance  is  made  active  for  the  next  assignment. 
This  process  continues  until  all  the  stop  points  are  as¬ 
signed  to  some  seed-point. 

Clustering  Method  2  [FAA2]:  The  second  method 
FAA2  uses  a  simple  heuristic  to  do  the  clustering.  Here 
all  the  seed  points  are  actively  in  the  contest  for  receiving 
the  next  stop  point  assignment.  The  stop  point  with  the 
minimum  distance  to  any  of  the  seed  point  is  selected  and 
assigned  to  that  suxu  point. 


Oo009»0  O  0  1023 

0000100001000000001000010 


Figure  2:  The  1023X1023  ^d  is  parti¬ 
tioned  into  25  sectors  for  encoding  the  neural 
network  system.  The  seed-points  are  shown 
in  the  shaded  regions. 

Clustering  Method  3  [FGAA]:  A  generalized  assign¬ 
ment  problem  is  solved  to  assign  the  stops  to  the  seed- 
points  (Fisher  and  Jaikumar,  1981). 

Clustering  Method  4  [COMBO  method]:  Many  optim¬ 
ization  problems  require  the  investigation  of  multiple  lo¬ 
cal  optimas.  Here  the  concept  of  sharing  functions  (Gold¬ 
berg  1987)  is  used  to  investigate  the  formation  of  stable 
subpopulations  of  different  strings  in  the  GA,  thereby 
permitting  the  parallel  search  of  many  peaks.  This 
method  uses  the  suing  recommended  by  the  GA  on  all 
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three  methods  (FAA1J^AA2,FGAA)  and  the  function 
with  the  best  performance  value  is  selected  to  return  the 
fitness  value  for  that  particular  string  to  the  GA  as  shown 
in  the  Figure  1.  The  three  methods  (FAA1,FAA2J^GAA) 
all  use  tlie  same  string,  and  due  to  the  competition 
between  widely  disparate  points  in  the  search  space,  help 
maintain  a  diverse  population  which  searches  many 
peaks  in  parallel. 

2.2  The  Mechanics  of  the  Adaptive  Search 

The  complex  process  of  multiple  vehicle  routing  op¬ 
timization  is  worldng  in  an  environment  E.  There  is  a  set 
of  control  parameters  C  available  for  the  adaptive  search 
sU'ategy.  Within  each  environment  there  is  a  fitness  meas¬ 
ure  for  the  performance  of  the  VRP  process  under  the 
conuxil  parameters.  Each  environment  e  in  E  to  which 
the  controlled  VRP  process  is  subjected  defines  a  perfor¬ 
mance  response  surface  over  the  control  parameter  space 
C,  defined  by  a  fitness  function  F,.  It  is  the  response  sur- 
farp.  defined  by  F,  that  is  cxploicu  by  iiic  adaptive  search 
strategy  in  order  to  generate  a  good  performance  of  VRP 
process.  In  our  problems,  the  function  Ft  is  extremely 
complex,  high-dimensional,  multimodal  and  discontinu¬ 
ous. 


Figure  3:  The  Figure  illustrates  the  activi¬ 
ty  in  the  last  50  Uials  concentrated  around 
few  good  points  in  the  search  space.  Each 

cvmhol  Hpfinpc  a 

The  GA  exploits  tlie  accumulating  knowledge  of  the 
VRP  process  being  controlled.  Each  point  in  the  control 
parameter  space  is  represented  as  a  genetic  string.  This 
string  is  represented  in  binary.  In  XVRP-GA,  the  control 
parameter  is  the  location  of  the  seed-point  in  2  dimen¬ 
sional  space.  Each  string  has  a  field  allocated  for  the  per¬ 
formance  function  Ft ,  which  is  returned  by  the  evalua¬ 
tion  function.  The  GA  maintains  a  population  of  these 


control  parameter  strings.  Each  individual  is  submitted 
for  evaluation  as  a  control  parameter  for  the  VRP  pro¬ 
cess,  saving  the  associated  performance  measure.  Finally, 
using  selection  probabilities,  these  control  parameter 
strings  undergo  reproduction  with  crossover  and  mutation 
genetic  operators. 

The  population  is  collection  of  candidate  control 
parameters  C.  Fixing  one  of  the  control  parameters  and 
leaving  the  other  parameters  free  defines  a  hyperplane. 
Each  parameter  has  (1023  X  1023)  possible  locations.  In 
our  case,  there  is  one  control  parameter  for  each  available 
vehicle.  Over  all  possible  hyperplanes,  the  search  space 
is  complex  and  multimodal.  We  observe  the  GA  rapidly 
exploits  accumulating  information  about  Ft  to  restrict 
sampling  to  those  hypctplanes  which  have  a  high  expec¬ 
tation  of  good  performance. 

The  search  space  defined  by  F,  is  multimodal.  Due  to 
the  competition  between  widely  disparate  points  in  the 
search  space,  using  multiple  sharing  evaluation  function 
helps  maintain  a  diverse  population,  which  could  prevent 
a  picmat'jrc  convergence  to  a  local  optima.  FAAl, 
FAA2  and  FGAA  aic  all  controlled  by  the  same  set  of 
control  parameters  C,  available  form  tnc  adapti''o  GA. 
As  the  adaptive  search  progress,  each  of  the  methods  is 
likely  to  be  sampling  different  hypcrplanes  looking  for 
peaks  in  parallel,  and  occasionally  use  the  control  param¬ 
eters  discovered  by  the  other  functions  to  start  searching 
a  hyperplane  that  is  providing  good  performance. 

Table  1,  illustrates  the  mechanics  of  a  crossover 
genetic  operator.  Figure  2  shows  the  seed-points  pro¬ 
duced  by  the  COMBO  method.  The  seed-point  string 
shown  in  Table  1  is  the  value  that  the  COMBO  method 
uses.  The  encoded  version  of  the  string  is  used  by  the 
GA.  To  illustrate  the  effect  of  crossover,  lets  assume  we 
are  using  the  standard  2-point  crossover  method.  Cl  1  and 
C12  are  the  crossover  points  in  Parent  1  and  C21  and 
C22  are  the  crossover  points  in  Parent  2.  Now,  if  the 
genetic  material  between  Cll  and  C12,  C21  and  C22  are 
exchanged,  two  offspring  strings  are  generated.  The  use 
of  iliC  CiuSSOver  lias  icSulieu  in  iwO  CaTididaic  Sccu-point 
locations,  with  seed-point  3  not  being  effected  by  the 
crossover  operation,  but  seed-point  1  and  2  are  moved  to 
a  new  location.  These  new  location  of  the  seed-points 
results  in  different  clusters,  which  intum  results  in  a  dif¬ 
ferent  fitness  measure  (total  ditance).  We  then  sequence 
the  stop  locations  in  each  of  the  candidate  clusters  in  a 
cost  effective  way  (TSP)  and  return  the  result  to  the  GA. 

The  first  few  generations  start  with  seed-point  location 
values  uniformly  disUibuted  over  the  search  space.  As 
generations  evolve,  seed-points  tend  to  be  concentrated 
in  tiglit  geographical  areas  due  to  the  survival  of  the 
fittest  mechanism  of  the  GA.  This  is  illustrated  in  Figure 
3  where  only  the  last  50  trials  are  plotted.  A  trial  is  a  sin¬ 
gle  execution  of  the  evaluation  function  on  a  candidate 
control  parameter  and  a  generation  consists  of  many  tri¬ 
als.  The  three  performance  curves  on  the  graph  shown  in 
Figure  4  illusuates  the  survival  of  the  fittest  nature  of  the 
GA  seareh.  The  performance  measure  is  total  distance 
traveled  by  the  fleet,  so  small  values  are  desirable.  The 
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Table  1  The  Table  illustrates  the  mechanics  of  a  crossover  genetic  operator.  The  Seed 
string  shown  in  Table  is  the  actual  value  that  the  COMBO  method  uses.  The  encoded 
version  of  the  seed  string  is  used  by  the  GA  in  COMBO  method.  Cll  and  C12  are  the 
crossover  points  in  Parent  1  and  C21  and  C22  are  the  crossover  points  in  Parent  2. 


The  Mechanics  of  the  Crossover  Operator 
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curve  indicates  the  worst  performance  of  the  evaluation 
function  as  function  of  generations.  The  bottom  curve  in 
Figure  4  indicates  the  best  performance  in  each  genera¬ 
tion.  The  middle  curve  is  the  plot  of  the  average  perfor¬ 
mance  of  the  evaluation  function.  The  decreasing  trend 
in  the  curves  illustrates  the  survival  of  the  fittest  candi¬ 
dates  in  the  population,  and  indicates  that  the  GA  is  doing 
much  better  than  a  random  walk  in  the  control  parameter 
search  space. 

Table  2  presents  empirical  work  that  illustrates  the 
parallel  nature  of  the  search  on  a  four  vehicle  problem. 
The  values  shown  in  Table  2  are  generated  when  each 
evaluation  function  (FAAl,  FAA2,  FGAA)  finds  a  seed- 
point  parameter  which  produces  a  performance  value 
better  than  the  best  found  up  to  that  point  in  time.  The 
FAAl  method  produces  the  first  best  solution  (example 
1).  FAA2  then  identifies  a  sequence  of  seven  improving 
solutions,  as  shown  in  examples2  through  8.  The  seed- 
points  that  produce  these  solution  are  the  result  of 
searches  centered  around  a  few  "good"  spots.  At  genera¬ 
tion  14,  the  GA  produces  a  seed-point  location  that  FAAl 
adopts  and  improves.  The  coordinate  value  revel  that  this 
seed-point  location  is  essentially  the  one  FAA2  was  using 
to  iiTipiovc  the  solution  pcrfonusucc,  ss  shown  in  cxnm- 
pie  9.  The  FGAA  method  which  had  not  generated  a  best 
solution  in  earlier  generations,  produces  one  at  genera¬ 
tion  17  using  seed-points  from  generation  2.  Also  note 
that  the  seed-points  in  example  11  are  in  a  completely 
different  area  of  the  search  space.  This  illustrates  the 
parallel  search  of  a  multimodal  response  surface  occur¬ 
ring  in  the  algorithm. 


NuTiDeP 

of  Generations 


Figure  4;  The  three  graphs  illustrates  the 
performance  of  the  GA  search  in  a  particular 
generation.  The  middle  curve  indicates  aver¬ 
age  performance  of  the  evaluation. 
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Table  2  Parallel  nature  of  the  adaptive  search.  Each  of  the  three  mctliods  is  able  to  ex¬ 
ploit  promising  seed-points  locales  discovered  by  the  other  methods. 
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3  Functional  Description  of  the 
Parallel/Distributed  Schema 

XVRP-PGA  uses  the  genetic  algorithm’s  population 
structure  to  parallelize  the  evaluation  process.  XVRP- 
PGA  is  an  asynchronous  distributed  genetic  algorithm 
that  is  designed  to  use  the  idle  CPU  cycles  on  a  hetero¬ 
geneous  network  of  workstations.  Each  workstation 
varies  in  speed,  CPU  architecture  and  operating  system. 

Distributed  programming  supports  the  creation  of 
multiple  processes  with  disjoint  address  spaces,  and  pro¬ 
vide  a  means  of  communication  among  them.  XVRP- 
PGA  uses  a  star  topology  in  which  there  is  a  special  node 
called  the  root  Each  processor  is  connected  to  the  root 
processor  which  administers  ilie  network  traffic.  As  Fig¬ 
ure  5  indicates,  this  star  topology  does  not  create  a  bottle 
neck  at  the  root  processor.  This  is  because  the  root  pro¬ 
cessor  is  only  doing  the  reproduction  and  selection  pro¬ 
cess  of  the  GA’s  and  ^he  dispatching  and  the  polling  of 
the  results,  while  the  remote  processors  arc  doing  the 
compute  intensive  evaluation. 

Remarkable  speedup  achieved  is  even  with  a  small 
number  of  processor  working  in  parallel  because  of  the 
long  computational  time  of  the  evaluation  functions.  At 
any  given  time  only  a  fraction  of  the  total  CPU  cycles  is 
used  in  a  heterogeneous  network  of  many  workstations. 
XVRP-PGA  detects  and  utilizes  these  wasted  CPU  cycles 
and  uses  them  in  parallelizing  the  GA  evaluation  of  the 
population.  We  use  the  Remote  Procedure  Call  (RPC) 
paradigm,  in  which  a  client  communicates  with  a  server. 
In  this  process,  the  client  first  calls  a  procediu-e  to  send  a 
data  packets  to  the  server.  When  the  packet  arrives,  the 
server  call  the  dispatch  routine,  performs  the  service  re- 
questetl,  and  sends  back  the  reply.  RPC  can  be  used  to 


communicate  between  processes  on  different  processors 
as  well  as  for  communication  between  different  processes 
on  the  same  processor.  RPC’s  are  blocking,  meaning  the 
client  procedure  waits  until  the  serve  processor  completes 
is  task.  This  would  defeat  the  purpose  the  parallelism.  In 
XVRP-PGA,  we  use  multiple  daemons  and  immediate 
acknowledgment  to  avoid  the  blocking  problem  of  the 
RPC. 


Genetic 

Algorithm 


Machine  B  Machine  C  Machine  D  Machine  E  Machine  F 


Distributed  Parallel  Architecture 


Figure  5:  The  heterogeneous  network  of 
workstations  used  for  parallel  evaluation  of 
population  members. 
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When  data  is  being  shared  by  two  or  more  distinct 
processor  types,  there  is  need  for  portable  data.  Integers 
are  represented  differently  on  distinct  architectures  and 
alignment  of  word  boundary  causes  the  size  of  a  data 
structure  to  vary  from  processor  to  processor.  The  exter¬ 
nal  Data  Representation  (XDR)  standard  is  used  to  solve 
the  data  portability  problem.  In  many  parallel  and  distri¬ 
buted  algorithms  the  time  spent  for  interprocessor  data 
communication  is  a  sizable  fraction  of  the  total  dme  re¬ 
quired  to  solve  the  problem.  The  communication  delays 
are  due  to  the  communication  processing  time,  schedul¬ 
ing  time,  transmission  time  and  medium  (Ethernet)  pro¬ 
pagation  time.  However,  in  our  experiments,  the  time 
spent  evaluating  each  solution  candidate  is  considerably 
more  than  the  total  communication  delays. 


Figure  6:  The  communication  between  the 
master  processors  and  the  various  remote 
slave  processors,  utilizing  daemons. 

Figure  6  illustrates  the  communication  between  the 
root  processor  and  the  various  remote  processors.  The 
dispatcher  gets  a  set  of  solution  candidates  selected  for 
evaluation.  This  happens  at  every  generation.  Tlte 
dispatcher,  as  shown  in  Figure  6.  sends  the  decoded  solu¬ 
tion  candidate  to  a  remote  site  for  evaluation.  The  remote 
site  accepts  the  parameter  and  sends  an  wknowledgment. 
The  dispatcher  then  sends  the  next  candidate  to  the  next 
available  remote  machine.  After  dispatching  the  candi¬ 
date  solution  to  ali  the  available  sites,  liie  dispatcher  pulls 
the  local_daemon  for  results.  In  the  meantime,  a  remote 
site  which  lias  completed  an  evaluation  sends  the  result  to 
the  local_daemon.  The  dispatcher  receives  the  result 
from  the  local_daemon  and  immexliately  dispatches 
another  candidate  to  the  site  which  returned  the  result  In 
case  of  a  remote  processor  malfunction,  the  dispatcher 
retransmits  the  candidate  parameter  to  a  functioning  pro¬ 
cessor.  The  specific  processors  used  in  our  experiments 


are  described  in  Section  S. 


4  Functional  Description  of  the  Neural  Net¬ 
work  Module 

In  this  section  we  describe  how  the  neural  network 
system  accumulates  and  preserves  knowledge  gained 
from  past  experiments.  The  neural  network  system 
described  in  (I^daba  et  al.,  1990)  is  employed  to  inject 
this  heuristic  knowledge  into  the  initial  population.  The 
transfer  of  previous  experience  is  of  significant  benefit  in 
acquiring  task  dependent  expertise  necessary  to  solve 
complex  real  world  problems.  The  neural  network  sys¬ 
tem  XVRP-NN  acts  as  an  pattern  associator,  matching 
the  descriptors  of  the  incoming  problems  with  that  of  the 
previously  solved  problems.  The  outcome  of  this  match¬ 
ing  is  a  promising  initial  population  for  the  GA.  In  this 
way,  the  search  space  is  considerably  pruned  and  the 
computational  effort  is  reduced  (Grefenstte,  1981;  Hay- 
ong,  1985).  The  neural  networks  store  parameters  of  pre¬ 
viously  solved  problems.  Figure  2  shows  how  the  param¬ 
eters  are  encoded.  The  parameters  are  seed-points  gen¬ 
erated  by  the  GA.  These  seed-points  are  recorded  in  the 
course  of  experimentations  with  various  problems  and 
various  environments  of  the  GA.  A  potential  problem  in 
initializing  a  GA  populations  is  premature  convergence. 
To  overcome  this,  we  seed  only  half  the  population  from 
the  recommendation  of  the  neirral  networks  and  the  other 
half  of  tlie  population  with  random  strings.  Also,  a 
matcher  first  checks  if  the  descriptors  of  the  incoming 
task  are  similar  to  the  training  set  of  the  neural  network. 
The  degree  of  similarity  is  the  sum  of  squared  errors 
between  the  incoming  descriptors  and  the  set  of  training 
patterns.  If  this  measure  is  high,  the  neural  network  sys¬ 
tem  detects  no  match,  and  the  GA  population  is  initial¬ 
ized  only  with  random  strings.  However,  due  to  the 
parallel,  multi-modal  nature  of  the  GA,  even  highly 
suboptimal  initial  starting  points  in  the  search  space  in¬ 
jected  into  the  population  will  die  out  in  subsequent  gen¬ 
erations. 

XVRP-NN  basically  consists  of  three  stages  each  of 
which  is  described  in  more  detail  in  this  section  The  first 
stage  consists  of  the  Stop  Data  Feature  Extraction 
(SDFE)  network.  In  essence,  a  feature  extraction  neural 
network  condenses  the  problem  at  hand  to  its  essential 
characteristics,  then  passes  the  condensed  information  to 
a  classification  network  that  that  is  used  for  training.  The 
teaching  stimulus  is  identical  to  the  input  stimulus,  letting 
the  internal  layers  of  the  network  adjust  to  "summarize" 
ths  input  dntu.  Such  ncurul  networks  nrc  nlso  culled 
self-organizing.  All  of  the  networks  were  trained  with 
the  back-propagation  learning  paradigm  (Rumelhart  and 
McClelland,  1986).  The  first  stage  creates  and  encodes 
the  domain  specific  data.  The  problem  data  consists  of 
cartesian  coordinates  of  the  stop  locations.  This  data  is 
pre-processed  to  make  it  invariant  to  scaling  anc  rotation 
(Fukushima  &  Miyaki,  1982),  producing  a  30  unit  prob¬ 
lem  descriptor  vector.  Routing  problems  are  usually  un¬ 
changed  under  rotations  and  translations  of  the  locations 
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of  stops  to  visit.  However,  it  is  well  known  that  neural 
netv.'orks  cannot  easily  discriminate  among  rotated  and 
translated  data  sets.  In  general,  the  problem  representa¬ 
tion  decision  is  critical. 


Figure  7:  XVRP-NN  basically  consists  of 
three  modular  neural  networks.  The  Stop 
Data  Feature  Extraction  (SDFE)  network,  the 
Seed  String  feature  extraction  (SSFE)  net¬ 
work,  and  the  generalization  network  GN. 

The  seed-point  data  consists  of  25  values.  These 
values  represent  the  location  of  the  seed-point  for  a  given 
instance  of  a  problem.  As  shown  in  Figure  7,  the  25 
seed-point  values  are  normalized  between  0.0  and  1.0  in 
the  output  encoding  module  and  is  used  as  a  teaching  out¬ 
put  in  the  Seed-string  feature  extraction  (SSFE)  network. 
The  second  stage  extracts  the  features  from  the  problem 
descriptors  and  seed-point  descriptors  which  is  then  fed 
to  the  generalization  network  GN,  as  training  data.  The 
SDFE  vectors  and  the  SSFE  vectors  are  stored  in  two 
separate  files.  These  two  files  are  then  used  to  train  the 
generalizing  network.  Thus,  there  are  3  off-line  trained 
networks  (SDFE,  SSFE  and  GN)  which  are  subsequently 
used  for  on-line  recall.  The  third  stage  is  on-line  general¬ 
ization,  resulting  in  the  seed-point  location  recommenda¬ 
tion  vector. 

The  testing  data  is  presented  to  the  data  encoding 
module,  which  produces  a  30  unit  problem  descriptor 
vector.  This  problem  descriptor  is  exposed  to  the  SDFE 
network  which  in  turn  produces  the  4  unit  feature  vector 
of  stop  data.  This  feature  vector  of  stop  data  is  then  ex¬ 
posed  to  the  trained  generalization  networic.  The  general¬ 
izing  network,  on  recall,  produces  the  4  unit  feature 
recommendation  vector,  which  is  presented  to  the  recon¬ 
struction  network.  The  original  dimension  of  25  units  in 
the  performance  vector  is  generalized  by  the  reconsuiic- 
tion  network.  This  vector  is  decoded  by  using  a  high 


value  close  to  1.0  to  represent  the  best  seed-point  location 
for  the  given  instance  of  the  test  stop  data.  Finally,  the 
raw  seed-point  data  is  presented  to  the  GA. 

5  Experimental  Results 

25  problems  were  generated  to  test  the  system.  The 
problems  are  fully  dense  with  200  stop  points,  4  vehicles 
with  a  utilization  factor  of  95  %,  and  were  generated  in  a 
square,  1023  miles  on  each  side.  The  problems  were  run 
on  a  network  of  SUN,  HP,  VAX,  and  NeXT  worksta¬ 
tions.  All  experiments  were  performed  using  a  modified 
GENESIS  (Grefenstette,  1984)  system.  The  Backpropa- 
gation  leammg  paradigm  (Rumelhart,  88)  was  employed 
to  train  the  neuial  networks. 


O  FAAi  A  FAA2 
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Figure  8:  Comparison  of  performance  of 
FAAl,  FAA2,  FGAA  and  COMBO  methods 
for  a  single  problem.  The  best  solution  in 
1000  trials  is  plotted. 


5.1  Performance  Improvement  with  Multiple  Shar¬ 
ing  Evaluation  Functions 

In  order  to  show  the  performance  improvements  of  the 
COMBO  method,  each  of  the  25  problems  were  run  for 
2000  trials  using  the  FAAl,  FAA2  and  FGAA  methods. 
All  the  GA  parameters  were  kept  constant  through  out 
the  experiments.  The  performance  improvement  of  the 
GA  by  using  multiple  sharing  evaluation  functions 
(COMBO  method)  for  a  single  problem  is  illustrated  in 
Figure  8.  Note,  the  three  individual  methods  do  not  im¬ 
prove  the  search  after  about  1000  trials,  but  the  perfor¬ 
mance  curve  of  the  COMBO  method  continues  to  drop 
through  subsequent  Uials.  This  result  is  consistent  over 
all  25  problems  tested,  and  is  indicated  in  Figure  9. 
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5.2  Performance  Improvement  with  the 
Parallel/Distributed  Schema 

It  is  very  difficult  to  report  performance  improvements 
of  parallel  algorithms  when  the  processors  are  of  varied 
strength.  In  a  typical  heterogeneous  network  of  comput¬ 
ers,  the  load  on  each  processor  at  any  give  time  is  very 
different  The  dispatcher  we  designed  does  not  preempt 
other  processes  in  the  network  but  utilizes  the  unus^ 
CPU  cycles  of  the  computers  on  the  network.  The  net¬ 
work  consisted  of  various  processors,  including  SUN 
3/260,  SUN  3/110,  SUN  3/50,  VAX  11/780,  NeXT,  HP 
9000,  and  AT&T  3B2.  Figure  10  shows  the  speed  up 
achieved  in  running  1000  trials  using  Method_l,  when 
we  start  with  the  weakest  processor  and  add  the  stronger 
processors.  Method_2  starts  with  a  strong  processor. 
Note  that  for  a  given  population  size,  there  is  a  limit  on 
the  number  of  remote  prcces.sors  we  can  use.  The  paral¬ 
lel  algorithm  is  robust  and  considerable  speed  up  is 
achieved  without  affecting  the  GA  search  mechanism. 
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Figure  9:  Multiple  evaluation  functions 
results.  The  COMBO  method  consistently 
performs  better  than  tliat  achieved  using  any 
single  evaluation  function. 


5.3  Performance  Improvement  with  the  Neural 
Network  Initialization 

Tne  neural  network  system  is  employed  to  inject 
heuristic  knowledge  into  the  initial  population.  Figure  11, 
shows  the  improvement  that  GA  search  received  on  a 
typical  problem  when  its  initial  population  was  seeded 
with  the  recommendations  from  the  neural  network  sys¬ 
tem.  The  seeded  method  finds  a  high  performance  param¬ 
eter  initially,  and  continues  to  improve  during  subsequent 
trials. 


o  Method.  1  □  Methoci.2 


Figure  10:  Performance  improvement 
with  the  parallel  implementation.  The  value 
plotted  is  the  CPU  time  required  to  run  1000 
trails  with  the  COMBO  method.  In  Method_l 
the  weaker  processors  are  added  and  in 
Mcthod_2  the  stronger  processors  are  added. 
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Figure  II:  Performance  improvement 

achieved  on  a  single  problem  when  the  initial 
population  of  the  genetic  search  is  seeded 
with  the  recommendations  from  the  neural 
network  system. 
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6  Conclusion 

The  results  demonstrate  considerable  impro-  .  in 
the  GA  search  by  using  multiple  sharing  evaluauou  func¬ 
tions.  Seeding  the  initial  population  with  heuristically 
chosen  population  structures  using  a  neural  network  sys¬ 
tem  also  achieves  a  significant  speedup.  The  heterogene¬ 
ous  network  of  computers  is  a  growing  trend  in  many 
research  institutions.  It  is  often  the  case  that  only  a  frac¬ 
tion  of  the  total  CPU  cycles  is  used  in  such  a  network. 
XVRP-PGA  utilizes  CPU  cycles  throughout  the  network 
and  uses  them  in  parallelizing  the  evaluation  of  the  popu¬ 
lation  structure  within  the  genetic  search.  Each  of  the 
methods  described  in  this  paper  can  be  used  separately  or 
together.  When  all  three  are  used,  relatively  powerful 
genetic  search  can  be  conducted  for  parameter  discovery, 
in  an  environment  of  networked  desktop  workstations. 
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Abstract 

In  this  paper  we  address  the  problem  of 
learning  disjunctive  normal  form  rules  from 
boolean-classified  examples.  Whereas  GA’s 
often  present  their  solution  in  tlic  form  of  a 
single  individual,  here  we  use  the  entire  pop¬ 
ulation  of  individuals  to  disjunctively  repre¬ 
sent  a  solution.  We  explain  example  sharing, 
a  method  of  payoff  sharing,  and  we  explain 
over-population,  a  modification  for  increasing 
GA  exploration. 

1  Introduction 

Problems  with  disjunctive  solutions  do  not  lend  them¬ 
selves  to  the  traditional  Genetic  Algorithm  (GA).  Typ¬ 
ically,  the  successful  Simple  Genetic  Algorithm  pro¬ 
duces  as  its  result  the  one  individual  representing  the 
best  evaluated  point  in  the  parameter  space.  Since 
individuals  usually  have  constant  length  and  disjunc¬ 
tions  are  by  nature  variable  in  length,  some  departure 
from  the  Simple  GA  is  necessary.  Disjunctior.  '  .ight 
be  achieved  by  either  “messy”  GA’s  (using  variable- 
length  individuals)  (Goldberg  et  al,  1989],  or  by  meth¬ 
ods  that  use  multiple  individuals. 

The  last  approach  was  taken  by  Wilson’s  classifier 
system,  called  “Boole”,  [Wilson,  1987).  The  “Boole” 
classifier  system  learned  disjunctive  boolean  concepts 
by  addressing  the  animat  problem — thus,  “Boole”  op¬ 
erated  under  the  constraints  of  learning  incrementally 
and  of  noisy  and/or  partial  feedback. 

For  many  inductive  learning  tasks,  however,  these 
constraints  do  not  hold,  and  we  can  learn  non- 
incrementally  with  a  database  of  correctly  classified 
examples.  In  this  case,  the  classifier  approach  seems 
to  complicate  the  model  unnecessarily. 

We  present  a  model  based  on  the  Simpk  GA  which 
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ments  of  our  model  are  1)  a  niche-formation  method 
which  utilizes  a  novel  version  of  payoff  sharing,  and  2) 
an  exploration-increasing  scheme  of  population  man¬ 
agement. 

Our  GA  produces  as  its  result  the  entire  popula¬ 
tion.  We  let  each  individual  represent  a  conjunction 
of  variables,  and  then  take  the  disjunction  of  the  whole 


population  to  produce  a  single  boolean  answer.  In  this 
way  the  GA’s  population  represents  a  disjunctive  nor¬ 
mal  form  boolean  formula. 

The  individual.?  arc  comprised  of  genes  which  may 
take  on  the  alleles  true,  false  and  don’t-care.  There  is 
a  one-to-one  correspondence  between  the  genes  of  an 
individual  and  the  boolean  variables  of  the  parameter 
space  to  be  searched.  If  the  gene  string  of  an  individ¬ 
ual,  acting  as  a  template,  “matches”  the  boolean  value 
string  of  an  example,  then  that  individual  includes  or 
covers  that  example.  Thus,  each  individual  represents 
the  conjunction  of  its  non-don’t-care  variable  values. 

The  question  becomes,  how  does  one  get  a  GA  to 
learn  a  population  of  templates  that  will  disjunctively 
cover  the  positive  examples  to  be  learned,  but  avoid 
covering  the  negative  examples.  One  obvious  solution 
is  to  use  a  niche-formation  approach  to  prevent  con¬ 
vergence  of  the  population  to  a  single  individual.  We 
accomplish  this  niche-formation  by  example  sharing. 


2  Example  Sharing 

Payoff  sharing  is  a  well-known  method  of  niche- 
formation.  Goldberg  and  Richardson  implemented 
sharing  by  decreasing  an  individual’s  payoff  accord¬ 
ing  to  its  proximity  to  other  individuals  [Goldberg  and 
Richardson,  1987).  This  method  successfully  produced 
clusters  of  individuals  on  the  peaks  of  the  evaluation 
function,  thus  performing  niche-formation  [Deb  and 
Goldberg,  1989). 

Niche-formation  behavior  is  what  we  need.  We  want 
individuals  to  cover  clusters  of  positive  examples  (the 
peaks)  while  avoiding  covering  the  negative  examples 
(the  valleys).  In  this  way  clusters  of  positive  examples 
in  the  parameter  space  are  desirable  niches. 


2.1  Definition  of  Example  Sharing 

Instead  of  sharing  b.iscd  on  a  proximity  measure,  how¬ 
ever,  we  find  that  lliio  piu’oleiii  lends  itself  to  a  more 
directly  applied  form  of  sharing  —  individuals  share 
the  value  of  each  example  they  cover  with  other  indi¬ 
viduals  that  also  cover  the  same  example.  Instead  of 
using  a  measure  based  on  the  geometry  of  the  param¬ 
eter  space,  independent  of  the  specific  examples  to  be 
learned,  this  enforces  direct  dependence  on  the  distri¬ 
bution  of  these  examples.  Conceptually  this  sharing 
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performs  equally  well  in  areas  of  the  parameter  space 
where  clusters  of  positive  and  negative  examples  are 
small  and  close  together  as  in  areas  where  the  exam¬ 
ple  clusters  are  expansive  and  widely  spaced.  It  is  not 
based  on  a  constant  distance  factor,  but  varies  with 
the  layout  of  the  positive  and  negative  examples.  This 
method  is  much  like  Wilson’s  distribution  of  environ¬ 
mental  feedback  among  the  action  set  of  classifiers. 

We  begin  the  calculation  of  example  sharing  payoff 
as  follows:  The  positive  examples  each  have  a  posi¬ 
tive  value,  and  the  negative  examples  each  a  negative 
value,  i.e.  land-1.  Determine  the  number  of  individu¬ 
als  that  cover  each  example,  and  divide  the  example’s 
Vedue  by  this  number.  The  result  is  the  payoff  from 
this  example  (e)  that  is  given  to  each  individual  (i) 
covering  it.  One  can  intuitively  think  of  each  exam¬ 
ple  as  having  only  a  finite  amount  of  payoff  to  give, 
and  this  finite  payoff  being  shared  equally  among  its 
recipients. 


payoff;  e 


0, 


valng> 


#  covering  e  ^ 


if  t  doesn’t  cover  e 
if  i  covers  e 


Then  the  fitness  of  an  individual  is 


fitness;  =  ^  payoff;  ^ 

all  e 


2.2  Adjusted  Payoff 

There  is,  of  course,  one  immediate  problem  with  the 
above  calculation.  It  is  possible  for  an  individual  which 
covers  negative  examples  to  receive  a  negative  pay¬ 
off  value,  confounding  completely  the  “roulette  wheel” 
method  of  selection.  The  probability  that  a  certedn  in¬ 
dividual  will  be  chosen  at  a  single  spin  of  the  wheel  is 


p(x  reproduce)  = 


fitness; 

fitness; 

all  t 


Clearly,  an  individual  with  a  negative  fitness  produces 
a  negative  probability — an  impossibility.  In  order  to 
properly  decide  the  probability  of  an  individual’s  selec¬ 
tion  according  to  its  evaluation,  all  evaluation  values 
must  be  positive. 

Giving  negative  examples  a  value  of  zero  is  not 
a  viable  solution,  since  this  would  give  no  penalty 
to  individuals  which  cover  them.  Instead,  wc  pass 
the  above  calculated  fitness  through  a  function  which 
makes  negative  numbers  positive  while  roughly  keep¬ 
ing  the  proper  relation  between  the  individuals’  fitness 
and  their  probabilities  of  selection. 

f  fitness;  -b  1,  if  fitness;  >  0 

adjusted  fitness;  =  I  if  fitness;  <0 


2.3  Specificity  Factor 

There  is  still  a  problem  with  our  approach  thus  far. 
The  above  calculations  count  equally  both  positive  and 
negative  examples.  How  many  positive  examples  do  we 
want  an  individual  to  gain  in  order  to  justify  covering 
an  additional  negative  example?  The  implementation 
of  the  above  model  could  conceivably  rank  equally  an 
individual  that  covered  three  positive  examples  and 


no  negative  examples  w'ith  an  individual  that  covered 
four  positive  examples  and  one  negative  example.  Is 
this  what  we  want?  Well,  that  depends  on  how  much 
specificity  versus  sensitivity  and  simplicity  we  want.  If 
we  penalize  an  individual  heavily  for  covering  a  neg¬ 
ative  example,  the  GA  will  tend  to  produce  highly 
specific  individuals.  This  may  be  desirable,  but  what 
we  gain  in  specificity  we  may  lose  in  sensitivity  (the 
population  will  tend  noi  to  be  able  to  cover  all  positive 
examples),  and  lose  in  simplicity  (the  population  will 
be  more  diverse,  have  less  duplication  of  individuals, 
and  therefore  translate  into  longer  disjunctive  normal 
form  formulas).  Thus,  we  introduce  a  user-settable 
specificity  factor  by  which  we  multiply  the  values  of 
the  negative  examples.  A  specificity  factor  of  one  will 
make  no  change  from  the  above  calculations;  increas¬ 
ing  the  factor  will  tend  to  create  more  and  more  spe¬ 
cific  individuals. 

3  Increasing  GA  Exploration 

In  addition  to  modifying  the  sharing  mechanism 
used  to  develop  niche-formation,  we  ha^  made  some 
changes  to  improve  the  exploration  of  t’  GA  so  that 
fewer  positive  niches  will  be  left  unco\..,ed.  Increas¬ 
ing  the  probability  of  mutation  and  the  probability  of 
crossover  have  the  obvious  effect  of  increasing  the  ex¬ 
ploration  of  a  population,  but  unfortunately  they  also 
tend  to  introduce  lethals  and  remove  desirably  fit  in¬ 
dividuals.  We  have  developed  a  modification  we  call 
over-population  which  increases  exploration,  yet  alle¬ 
viates  this  problem. 

3.1  Over-population 

Over-population  is  a  population  management  strategy 
somewhat  like  population  overlap  [DeJong,  1975].  In 
population  overlap  a  certain  percentage  of  the  indi¬ 
viduals  in  a  population  are  chosen  to  skip  crossover 
and  mutation,  and  i)ass  unchanged  to  the  next  gen¬ 
eration.  In  over-population  the  entire  population  is 
passed  on  unchanged.  How  does  any  innovation  take 
place,  you  ask?  Each  individual  of  the  population  does 
go  through  crossover  and  mutation  to  produce  new  in¬ 
dividuals  for  the  next  generation — but  the  parents  and 
unmutated  individuals  also  remain  in  the  population. 
In  this  way,  each  operation  doubles  the  population  size. 

For  example,  an  initial  population  of  20,  after  mat¬ 
ing,  will  contain  both  the  parents  and  the  children,  for 
a  total  of  40  individuals.  Mutation  of  this  set  of  in¬ 
dividuals  will  produce  both  the  mutated  parents  and 
children,  and  the  unmutated  parents  and  children,  for 
a  total  of  80  individuals.  Payoff  sharing  now  occurs 
among  all  80  individuals.  This  quadrupled  population 
is  brought  back  to  its  original  size  of  20  in  the  selec¬ 
tion  step,  which  only  spins  the  fitness  roulette  wheel 
20  times. 

This  procedure  can  be  thought  of  as  a  global  best¬ 
preserving  algorithm.  It  allows  us  to  make  all  in¬ 
dividuals  mate  (p(cioss)  =  1.0,  no  mating  restric¬ 
tion),  and  all  individuals  mutate  (p(gcne  change)  = 
l/lcngth,n<,,y,^u„,),  in  older  to  increase  GA  exploration 
to  the  maximum,  while  it  also  retains  the  best  of  both 
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the  pro-  and  post-exploration  population  for  consider¬ 
ation  in  selection.  Intuitively  one  can  imagine  a  sit¬ 
uation  where  the  birthrate  far  exceeds  the  sustaining 
support  of  the  environment,  and  individuals  are  dying 
off  not  from  old  age,  but  from  competition  with  each 
other — thus,  we  say  over-population. 

3.2  Over-population  Performance 

This  method  quadruples  the  number  of  individuals  to 
be  evaluated,  but  the  increased  exploration  makes  the 
GA  converge  to  a  solution  more  quickly.  We  performed 
experiments  using  the  AP*  10  data  set  of  bactericJ 
classification  [Robertson  and  MacLowry,  19751.  The 
bacteria  are  described  by  ten  binary  attributes  and 
are  classified  as  either  e  coh  or  not  e  colt.  There  are 
37,555  examples  of  276  different  10-bit  patterns,  the 
more  common  bacteria  being  replicated  proportion¬ 
ally  more  times.  Using  a  population  size  of  30  and 
a  specificity  factor  of  3,  we  ran  the  example-sharing 
GA  both  with  and  without  over-population  ten  times 
each.  The  median  number  of  generations  required  to 
learn  the  bacterial  classification  to  98%  accuracy  with 
over-population  was  5  (range  1  to  26)  versus  46  (range 
4  to  138)  without  over-population. 

We  believe  that  enhanced  exploration  resulting  from 
over-population  is  especially  helpful  in  situations  such 
as  this,  where  not  all  of  the  points  in  the  parameter 
space  have  a  specified  value. 

In  addition  to  faster  convergence,  we  expect  over¬ 
population  to  allow  the  use  of  smaller  populations  than 
are  otherwise  necessary  to  keep  genetic  diversity.  We 
have  not  yet  shown  this  experimentally. 

4  Comparison  with  “Boole” 

— Experimental  Results 

In  this  section,  we  present  the  performance  of  our  GA 
on  examples  from  a  disjunctive  boolean  function,  and 
compare  the  results  with  “Boole.” 

4.1  The  Example  Set 

The  examples  to  be  learned  conform  to  the  “6- 
multiplexer”  function  also  used  by  Wilson  [Wilson, 
1986].  The  function  can  be  described  as  follows.  Of 
the  six  bits  of  the  parameter  space,  the  first  two  bits 
can  be  thought  of  as  address  bits,  while  the  last  four 
make  up  the  data  bits.  The  value  of  an  c.xample  is 
determined  by  the  value  of  the  data  bit  found  at  the 
location  specified  by  the  address  bits.  For  instance, 
the  example  (00  1  00  0)  would  be  classified  as  a  pos¬ 
itive  example  since  the  first  two  bits  specify  address 
zero  and  the  zeroth  data  bit  is  a  one.  The  example  (1 
10  110)  would  be  classified  as  a  negative  example 
because  the  third  bit  is  a  zero. 

For  instance,  one  possible  set  of  disjuncts  that  covers 
all  the  positive  examples  and  no  negative  examples  of 
the  six-multiplexer  is: 

0  0  1  *  *  * 

0  1  *  1  *  * 

1  0  *  *  1  * 

1  1  *  *  +  1 


4.2  Implementation  Details 

The  example-sharing  GA  with  over-population  was 
implemented  in  Allegro  Common  Lisp  on  a  NeXT 
cube.  The  following  simulation  results  were  obtained 
in  an  average  of  60  seconds  of  CPU  time. 

4.3  Simulation  Results 

We  set  the  population  size  to  50  and  the  specificity  fac¬ 
tor  to  3.  Note  that  example  sharing  inherently  spec¬ 
ifies  a  p(cross)  of  1.0  and  p(gene  change)  of  1/indi¬ 
vidual  length.  Uniform  crossover  was  used  for  mat¬ 
ing  [Syswerda,  1989].  Based  on  ten  runs,  the  example 
sharing  GA  achieved  perfect  performance  (100%  accu¬ 
racy)  on  the  6- multiplexer  example  set  after  a  median  5 
generations,  or  1000  trials.  This  compares  very  favor¬ 
ably  with  Wilson’s  12,000  trials  and  97.3%  accuracy 
[Wilson,  1986]. 

5  Future  Work 

One  question,  as  yet  unresolved,  is  whether  this  ap¬ 
proach  to  learning  disjunctive  concepts  has  any  ad¬ 
vantage  over  known  methods  such  as  AQ  [Michalski  et 
ai,  1986]  and  C4  (ID3)  [Quinlan,  1986]. 

We  did  some  preliminary  comparisons  with  C4  using 
the  same  bacterial  classification  data  previously  men¬ 
tioned  in  section  3.2.  The  example-sharing  GA  with 
over-population  converged  after  a  median  of  5  gener¬ 
ations  to  a  simple  2-term  solution  that  has  99%  accu¬ 
racy.  C4,  after  pruning,  produces  a  tree  with  10  nodes 
and  11  leaves  with  99.5%  accuracy.  Thus  our  GA  pro¬ 
duces  a  simpler  representation  with  slightly  less  accu¬ 
racy.  Changing  the  specificity  factor  would  very  likely 
filter  the  accuracy  and  simplicity. 

Efficient  performance  is  another  concern.  Scale- 
up  may  be  better  because  of  the  implicit  parallelism 
of  GA’s.  However,  Quinlan’s  assessment  of  “Boole” 
indicated  that  C4  and  “Boole”  scaled  approximately 
equally  [Quinlan  and  Compton,  1987]. 

Furthermore,  the  global  nature  of  our  example  shar¬ 
ing  (considering  the  entire  population  and  all  the  ex¬ 
amples  together)  may  permit  this  approach  to  find 
more  desireable  DNF  rules  than  would  be  found  using 
the  more  local  search  involved  in  decision-tree  algo¬ 
rithms  or  AQ-likc  approaches. 

6  Conclusions 

This  p^mer  presents  twn  modifications  to  the  Sim¬ 
ple  GA  that  accomplish  the  learning  of  disjunctive 
boolean  concepts.  Example  sharing  has  been  shown 
to  successfully  form  niches  over  clusters  of  positive 
examples.  We  have  introduced  over-population  as  a 
method  of  increasing  exploration  and  have  shown  that 
it  can  improve  the  successful  convergence  rate  by  a 
factor  of  nine.  Our  modified  GA  successfully  learned 
the  6-multiplexer  boolean  function  in  one-twelfth  the 
number  of  trials  as  “Boole.”  Additional  work  will  be 
needed  to  determine  its  value  as  a  general-purpose  in¬ 
ductive  learning  method. 
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Abstract^ 

Genetics  based  machine  learning  systems  are 
considered  by  a  majority  of  machine  learners  as 
slow  rate  learning  systems.  In  this  paper,  we 
propose  an  improvement  of  Wilson's  classifier 
system  BOOLE  that  shows  how  Genetics 
based  machine  learning  systems  learning  rates 
can  be  greatly  improved.  This  modification 
consists  in  a  change  of  the  reinforcement 
component.  We  then  compare  the  respective 
performances  of  this  modified  BOOLE,  called 
NEWBOOLE,  and  a  neural  net  using  back 
propagation  on  a  difficult  boolean  learning 
task,  the  multiplexer  function.  The  results  of 
this  comparison  show  that  NEWBOOLE 
obtains  significantly  faster  learning  rates. 


1  Introduction 

In  recent  years.  Genetics  Based  Machine  Learning 
(GBML)  has  received  increasing  attention  from  the  ML 
community  due  to  the  emergence  of  Classifier  Systems. 
However,  despite  the  demonstrative  results  obtained  by 
various  researchers  with  Classifier  Systems  (CS) 
[Goldberg,  89],  the  slow  learning  rates  that  are  usually 
observed  have  considerably  affected  their  credibility.  The 
results  reported  in  this  paper,  using  an  improvement  of 
Wilson's  BOOLE  system,  tend  to  show  that  convergence 
speed  of  GBML  .systems  can  be  greatly  improved. 


2.  Slow  learning  rates  in  GBML 
systems. 

A  number  of  realizations  in  the  domain  of  CS 
have  shown  their  undisputed  ability  to  learn.  Classifier 
Systems  [Holland,  86]  form  a  family  of  inductive 
learning  systems  which  acquire  rules  incrementally. 
However,  these  realizations  all  share  the  common 
drawback  of  slow  rate  learning  when  compared  to  other 
widespread  learning  algorithms  such  as  decision  tree 
classification  (such  as  ID3)  or  neural  net  back 
propagation. 

Wilson's  classifier  system  BOOLE  [Wilson,  87] 
is  an  example  of  such  a  realization.  BOOLE  is  an 
incremental  learning  system  that  learns  intricate 
boolean  functions  such  as  logic  multiplexers.  However, 
more  recently,  Quinlan  [Quinlan,  88]  compared  the 
respective  performances  of  an  improved  version  of  the 
ID3  algorithm  (C4)  and  BOOLE  on  the  multiplexer 
problem,  and  evidenced  a  much  faster  convergence  rate 
with  C4.  We  show  in  this  paper  that  this  drawback  can 
be  greatly  weakened  by  modifying  the  reinforcement 
component  of  the  original  algorithm.  Furthermore,  C4 
is  non  incremental,  and  as  Booker  mentions  in  [Booker, 
89],  having  access  to  all  the  examples  at  once  is  a 
definite  advantage.  Therefore,  in  our  experiments,  we 
decided  to  compare  the  improved  version  of  BOOLE 
(NEWBOOLE)  with  a  widely  used  incremental  learning 
system:  a  neural  net  using  back-propagation. 


^  This  research  was  partially  supported  by  MRT 
through  PRC  lA. 
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3.  The  BOOLE  Classifier  System 

We  now  present  BOOLE  with  some  detail  concerning 
the  parts  that  have  been  changed,  in  order  to  explain  the 
NEWBOOLE  system  in  the  following  section. 

BOOLE  is  a  simplified  version  of  the  standard 
Classifier  System  (CS)  which  was  designed  by  Wilson 
to  test  the  ability  of  a  GBML  system  to  learn  difficult 
boolean  functions. 

Like  any  CS,  Boole  maintains  a  population  of 
classifiers  (which  can  be  thought  of  as  bit-level  zero 
order  rules)  according  to  Darwinian  evolution  principles. 
However,  classifiers  are  not  chained;  they  directly 
provide  an  output  and  the  decision  is  made  within  a 
single  step  during  recognition;  consequently  there  is  no 
message  list  nor  Bucket  Brigade  Algorithm.  Thus  each 
classifier  consists  of  a  condition  (taxon)  and  an  action 
which  are  fixed  length  strings  over  the  {0,1,#)  alphabet. 

Like  other  CS,  BOOLE  has  the  following 
components: 

1/  Performance  component:  in  the  performance 
cycle,  an  input  string  is  presented  to  the  system,  the 
match  set  M  of  all  classifiers  whose  taxa  match  the 
input  string  is  formed,  and  a  single  classifier  from  M  is 
selected  (using  a  probability  that  is  proportional  to  its 
strength)  whose  action  is  output  as  the  system's 
decision. 

2/  Reinforcement  component:  this  component 
modifies  the  strengths  of  classifiers  according  to 
performance  level: 

a/  Form  the  action  set  [A]  consisting  of  classifiers 
from  [M]  whose  action  is  the  same  as  the  chosen 
action;  the  remaining  members  of  [M]  form  the  set 
Not[A); 

b/  Deduct  a  fraction  e  from  the  strengths  of  all 
classifiers  in  [A]; 

c/  *  If  the  system's  decision  was  correct,  distribute 
a  payoff  quantity  R  to  the  strengths  of  [A];  but 

*  If  the  decision  was  wrong,  distribute  a  payoff 
quantity  R'  (where  0  <  R'  <  R)  to  the  strengths  of  [A] 
and  deduct  a  fraction  p  from  the  strengths  of  [A]  (at  least 
one  of  R’  and  p  is  equal  to  0); 

d/  Deduct  a  fraction  t  from  the  strengths  of  Not  [A]. 

The  distribution  ot  payolf  is  done  so  that  rules  which 
have  many  # 's  (thus  more  general)  are  favored. 

3/  Discovery  component,  which  modifies  the 
classifier  population  according  to  Holland's  genetic 
algorithm  [Holland,  75]  and  employs  reproduction. 


genetic  operations  (crossover  and  mutation),  and 
deletion. 

BOOLE'S  version  of  the  genetic  algorithm  is 
quite  particular  in  the  sense  that  only  one  offspring  is 
added  per  invocation  of  the  genetic  algorithm.  In  this 
context,  the  parameter  p  will  represent  the  average 
number  of  invocations  of  the  genetic  algorithm  per 
cycle  (i.e.  the  number  of  offspring  added  per  cycle).  For 
the  detailed  algorithm,  please  see  [Wilson,  1987]. 


Wilson  experimented  in  [Wilson,  87]  with  this 
system  using  a  highly  disjunctive  function,  the 
multiplexer  function,  also  used  by  Barto  [Barto,1985]. 
In  the  case  of  the  "6-multiplexcr",  for  each  six-bit  input 
string  (hq,  a^,  xq,  xj,  X2,  X3),  the  boolean  expression 
of  the  output  is: 

F6  =  -.ao  .  -.aj  .  Xq  +  Bq  .  ->&i  .  + 

+  -naQ  .  aj .  X2  +  aQ  .  aj ,  X3  (1) 

Figure  2  shows  an  experiment  in  which  BOOLE 
learned  to  respond  correctly  to  this  problem.  The 
parameters  used  for  this  experiment  are  given  in  section 
4. 

At  each  cycle,  an  example  is  chosen  at  random  and  is 
presented  to  the  system.  The  graph  plots  the  system's 
average  score  which  is  the  percentage  of  correct 
decisions  over  the  past  50  cycles  versus  the  number  of 
cycles  since  the  experiment  began. 

The  results  obtained  by  BOOLE  show  that  a 
"rather  difficult  disjunctive  incremental  learning  task" 
can  be  solved  by  GBML.  However,  the  learning  rate  is 
extremely  slow,  as  was  pointed  out  by  Quinlan  in 
[Quinlan,  88],  where  he  compares  the  respective 
performances  of  BOOLE  and  C4  on  the  same 
multiplexer  task. 

4.  The  NEWBOOLE  CS 

NEWBOOLE  is  a  CS  derived  from  Boole  which 
obtains  much  faster  learning  rates.  We  examine  in  this 
section  the  improved  learning  algorithm. 

4.1  A  new  payoff  strategy: 

"Symmetrical  payoff-penalty 

BOOLE’S  reinforcement  component,  under  the 
"payoff-penalty"  reinforcement  regi.me  (p  0 )  adjusts 
classifier  strengths  in  the  following  way: 

-  if  the  system's  decision  is  correct,  distribute  a 
quantity  R  to  the  strengths  of  the  Actionset  [A]. 
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-  if  the  system's  decision  is  false,  penalize  the 
strengths  of  [A]  by  deducting  a  fraction  p  from  their 
values. 

-  finally,  whether  the  system's  decision  is  correct 
or  not,  deduct  fractions  e  and  t  respectively  from  the 
sU'cngths  of  [A]  and  Noi[A]. 

Thus,  following  each  performance  cycle,  only  the 
strengths  of  [A]  are  ajusted  according  to  the  system's 
performance. 

However,  once  we  know  that  [A]  contains 
accurate  classifiers,  we  also  know  that  Not[A]  only 
contains  inaccurate  classifiers;  in  this  case  it  would 
make  sense  to  penalize  the  rules  in  Not[A].  This 
acknowledgement  led  us  to  a  "symmetrical  payoff- 
penalty"  algorithm,  in  which  we  respectively  reward  and 
penalize  the  accurate  and  inaccurate  classifiers  present  in 
the  MatchseL 

The  new  reinforcement  component  is  the 
following: 

1/  Form  the  subset  of  [M]  consisting  of  those 
classifiers  whose  action  is  accurate;  this  is  the  correct 
set  [C].  The  remaining  members  of  (M]  form  the  set 
NOT[C]. 

2/  Deduct  a  fraction  e  from  the  strengths  of  fC]. 

3/  Since  [C]  contains  the  accurate  classifiers, 
distribute  a  payoff  quantity  R  to  the  strengths  of  [C]. 

4/  Since  Not[C]  contains  the  inaccurate 
classifiers,  deduct  a  fraction  p  from  the  surengths  of 
Not[C]. 

Thus,  the  effect  of  the  reinforcement  component 
can  be  written  as: 


S[C](l+l)  =  (1'®)  ^  ^ 

(3) 

^Not[C](^+^)  =  ^  ^Not[C]^‘) 

(4) 

where  and  are  respectively  [C]'s  and 

Not[C]'s  total  strengths. 

This  new  algorithm  constitutes  a  clear  departure 
from  Boole:  indeed,  if  we  have  several  possible  output 
values  then  the  knowledge  of  the  correctness  of  the 
output  of  each  classifier  from  the  match  set  is  used. 
This  informauon  can  be  provided  by  the  knowledge  of 
the  correct  output  for  each  example,  as  is  done  in  most 
learning  systems.  However,  this  does  not  make  any 
difference  with  boolean  functions  such  as  the 
Multiplexer  since  only  two  values  are  possible;  if  one 
is  known  as  wrong,  then  the  other  one  is  right. 


As  in  Boole,  the  payoff  R  to  [C]  is  distributed 
by  a  biased  distribution  function  D,  which  favors  more 
general  rules  (i.e.  with  many  "don't  cares"  #)  as  follows. 

First,  the  generality  of  each  classifier  i  of  length  L  is 
computed  as: 

number  of  #'s  in  i 

gi= - L - 

Let  us  define: 

di  =  1  +  G  X  gi  (6) 

where  G  is  a  "generality  emphasis"  parameter. 

Then,  the  portion  of  reward  Ri  that  is  given  to 
classifier  i  becomes: 

Ri  =  D(i)  X  R  =  R  (7) 

Zdi 

i 

4.2  Experiments  with  NEWBOOLE 

We  experimented  with  NEWBOOLE  using  the 
multiplexer  problem,  our  main  concerns  being  on  the 
one  hand  to  compare  BOOLE'S  and  NEWBOOLE's 
respective  performances,  and  on  tiie  other,  to  compare 
NEWBOOLE  and  a  neural  net  using  Back  Propagation 
(BP).  We  describe  two  sets  of  experiments,  one  with  the 
6-muItiplexer,  the  other  with  the  11 -multiplexer. 

Each  experiment  was  conducted  by  making  4 
independent  runs  with  different  random  initializations 
and  averaging  the  values  over  these  runs. 

The  table  below  gives  a  complete  description  of  the 
genetic  experimental  parameters  used. 


4.2.1  NEWBOOLE  and  the  6-muItiplexer 
problem. 

1/  In  the  first  experiment  (Figures  1,  2),  we 
tested  NEWBOOLE's  performance  using  exactly  the 
same  parameter  va’ues  as  in  BOOLE  in  order  to  evaluate 
the  effect  of  the  change  in  the  reinforcement  component. 
As  it  can  be  seen,  the  results  are  quite  demonsU'ative; 
without  any  "parameter  tuning",  we  are  able  to  enhance 
importantly  the  system's  learning  rate  to  97.3  %  after 
only  2000  tnals.  The  learning  rate  only  takes  into 
account  the  system's  performance. 

The  lower  plot  in  Figure  1  shows  a  quantity  called 
the  relative  solution  count,  equal  to  the  relative  number 
of  instances  of  the  solution  set  [S6],  which  represents 
the  minimal  set  of  classifiers  capable  of  solving  the 
problem  perfectly. 
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Param. 

Experi. 

e 

cross¬ 

over 

rate 

Vi 

mutation 

rate 

G 

genera¬ 

lity 

enfor¬ 

cement 

R 

reward 

P 

penalty 

coef. 

P 

renewing 

P 

popula¬ 

tion 

deter- 

mini-stic 

output 

Boole 

6Mux 

Figl,2 

0.1 

0.125 

0.001 

4 

1000 

0.8 

1 

0.1 

400 

NO 

NcwBoole 
6Mux 
(Fig  1.  2) 

0.1 

0.125 

0.001 

4 

1000 

0.78  ‘ 

1 

//// 

400 

NO 

NcwBoole 
6Mux 
(Fig.  3) 

0.1 

0.5 

0.001 

4 

1000 

0.95 

4 

//// 

400 

YES 

NcwBoole 
11  Mux 
(Fig.  4) 

0.1 

0.5 

0.001 

4 

1000 

0.95 

4 

//// 

1000 

YES 

Table  1:  Experimental  parameters  for  Boole  and  NcwBoole. 


6-multiplexer  with  untuned  NewBoole 


^  p  is  taken  as  0.78  and  not  0.80  in  order  to  ensure  total  equivalence  between  the  experiments  with  Boole  and 
untuned  Newboole,  since  the  parameter  t  is  no  longer  used  in  the  Newboole  algorithm. 
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Hence,  this  ratio  is  a  measure  of  the  validity  of  the 
population:  the  predominance  of  [S6]  in  the  evolved 
population  would  show  that  the  system  is  capable  of 
finding  the  best  among  the  accurate  classifiers. 


It  is  interesting  to  notice  that  eventhough  the  system 
attains  quasi-perfect  response  (99.8)  after  3200  cycles, 
the  validity  (solution  count)  continues  to  grow  at  a 
similar  rate  than  in  BOOLE. 


NewBooIe  versus  Boole 


Figure  2:  Comparison  between  untuned  NcwBoole  and  Boole. 


Cycles 

Score  (%) 
3oole 

Score  (%) 

intuned 

'lewBoole 

Error(%) 

Boole 

Error  (%) 
'lewBcole 

0 

48 

50 

52 

50 

500 

75.9 

86.7 

24,1 

13,3 

1000 

84.6 

92,9 

15,4 

7.1 

3200 

91.2 

99.8 

8.8 

0.2 

5200 

93.9 

100 

6,1 

0 

12000 

97.3 

100 

2,7 

0 

Table  2:  Comparison  between  Boole's  and  NewBoole's 
convergence  rates 


2/  We  present  in  the  second  experiment  (Figure 
3)  a  "tuned"  version  of  the  NEWBOOLE  algorithm. 

*  Firstly,  we  modified  the  values  of  certain 
parameters  in  order  to  speed  up  the  learning  process  (sec 
Table  1). 


*  Secondly,  we  modified  the  Performance 
Component  in  the  following  way:  instead  of  selecting 
the  "decision"  classifier  probabilistically,  we 
systematically  picked  the  highest  ranked  classifier  in  the 
match  set:  this  deterministic  selection  affects  in  no  way 
the  learning  process,  since  the  Correct  and  NoiCorrect 
sets  are  not  determined  in  function  of  the  selected 
classifier.  This  modification,  as  noted  in  [Booker,  89], 
permits  a  more  steady  convergence  level;  furthermore, 
since  we  are  comparing  NEWBOOLE  with  a 
deterministic  algorithm  (Back  Propagation),  it  seemed 
logical  to  include  some  "determinism"  in  the  algorithm. 

We  obtained  an  impressive  learning  rate  after 
only  800  uials  by  modifying  the  value  of  the  fraction 
deducted  from  the  set  of  inaccurate  classifiers  (p  =  0.95), 
the  frequence  of  invocation  of  the  genetic  algorithm  per 
cycle  (p  =  4),  and  the  crossover  rale  =  0.5).  Tiiis 
constitutes  approximately  a  17  fold  improvement  over 
Boole’s  origin^  convergence  rate. 
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3/  Also  in  Figure  3,  we  present  an  experiment  using 
a  Neural  Network  with  a  Back-Propagation  learning 
algorithm.  The  architecture  (6:20-20-10-10:1)^  and  the 
parameters  were  tuned  for  this  problem.  Indeed,  we  see 
that  convergence  is  reached  after  1600  trials. 

Of  course,  we  noticed  that  a  more  complex  network 
(6:100-50-40-30:1)  converges  within  900  trials. 
However,  the  performance  is  not  better  than  with 
NEWBOOLE,  and  the  memory  occupied  (  6  x  100  + 
100  X  50  +  50  X  40  +  40  X  30  +  30  =  8830 
connections  =  8830  x  4  =  35320  bytes)  is  unreasonably 
beyond  what  our  population  occupies  (400  classifiers  of 
7  units  each  with  2  bits  per  unit  =  800  bytes). 

Therefore,  when  comparing  NEWBOOLE  with  a 
Neural  Net  (NN)  using  BP,  we  restricted  ourselves  to 
reasonable  networks. 


^This  notation  means  that  the  multilayered  network 
iidb  6  inputs,  tiieii  two  layers  of  20  cells,  tiien  two  layers 
of  10  cells,  and  one  output  layer  of  one  cell. 

In  all  the  networks,  the  parameters  of  each  cell 
depend  on  the  number  of  cell  inputs  Njn: 
learning  rate  e  =  0.1/sqrt(Nin),  decay  8  =  0,  noise  0  =  0, 
momentum  a  =  0;  weights  Wij(O)  are  initialized 
randomly  over  the  interval  [-1.5/Nin;  1,5/Nin]. 


Please  also  note  that  the  simplest  NN  that  can  solve 
our  problem  (6:4:1)  needs  7500  cycles  to  converge;  this 
compares  with  the  number  of  cycles  NEWBCX)LE  needs 
to  find  the  minimal  set  (around  5000  cycles  for  80  %  of 
minimal  rules;  the  other  rules  have  a  very  low  strength 
and  can  easily  be  removed);  however,  NEWBOOLE  finds 
this  set  automatically,  whereas  the  NN  architecture  had 
to  be  provided  at  first. 


4.2  NEWBOOLE  and  the  11-muItIpIexer 
problem. 


Figure  4  shows  the  results  obtained  using 
NEWBOOLE  to  solve  the  11 -multiplexer  compared  with 
those  obtained  using  the  following  neural  net:  (11:40- 
40-20-20:1). 

The  population  was  increased  by  a  factor  of  2,5  in 
order  to  fit  the  considerably  larger  classifier  search  space 
which  grew  by  a  factor  of  3^  ^  x  2  /  (3^  x  2)  =  243. 

The  number  of  links  of  the  neural  net  rose  much 


titorc  (by  a  factor  of 
population  size. 


f  'TOA 

/  /JU 


4,5)  liuiii  the 


Nonetheless,  one  notices  that  NEWBOOLE  still 
converges  at  a  faster  rate  than  the  neural  net. 
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^  11 -multiplexer 


Figure  4:  Compp.nson  bet  A'een  deterministic  NewBoole  and  Back-Propagation  on  the  1 1-multiplexer  problem. 

(symmetrically  smoothed  over  1000  cycles) 


5.  Conclusion 


In  this  paper,  we  presented  an  improvement  to 
the  Boole  system;  indeed,  we  showed  that  convergence 
speed  could  be  drastically  improved,  thus  showing  that 
GBML  systems  can  learn  far  faster  than  were  portrayed 
by  Quinlan  in  [Quinlan,  88].  Furthermore,  the 
comparison  with  an  other  incremental  learning 
algorithm,  the  widely  used  neural  net  with  back- 
propagation,  showed  that  NEWBOOLE  converges  at 
least  as  fast.  The  basic  difference  between  GBML  and 
Connectionist  learning  thus  seems  to  reside  in  the  fact 
that  Classifier  Systems  are  rule  based  systems  which 
provide  comprehensive  solutions,  whereas  neural 
networks  merely  provide  sets  of  numerical  coefficients 
without  any  senr.antic  meaning. 
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Abstract 

An  agent  that  must  learn  to  act  in  the  world 
by  trial  and  error  faces  the  reinforcement 
learning  problem,  which  is  quite  different 
from  standard  concept  learning.  Although 
good  algorithms  exist  for  this  problem  in 
the  general  case,  they  are  quite  inefficient. 

One  strategy  is  to  find  restricted  classes  of 
action  strategies  that  can  be  learned  more 
efficiently.  This  paper  pursues  that  strat¬ 
egy  by  developing  algorithms  that  can  effi¬ 
ciently  learn  action  maps  that  are  express¬ 
ible  in  1;-DNF.  Both  connectionist  and  classi¬ 
cal  statistics-based  algorithms  are  presented, 
then  compared  empirically  on  three  test 
problems.  Modifications  and  extensions  that 
will  allow  the  algorithms  to  work  in  more 
complex  domains  are  also  discussed. 

1  Reinforcement  Learning 

Consider  an  agent  that  must  learn  to  act  in  the  world. 
At  each  moment  in  time,  it  gets  information  about 
the  world  from  its  sensors  and  must  choose  an  action 
to  take.  Having  executed  an  action,  the  agent  gets  a 
signal  from  the  world  that  indicates  how  well  the  agent 
is  performing;  we  shall  call  this  a  reinforcement  s\gna\. 
The  reinforcement  signal  can  be  binary  or  real-valued 
and  it  will  typically  be  noisy. 

This  learning  scenario  is  quite  different  from  stan¬ 
dard  concept  learning,  in  which  a  teacher  presents 
the  learner  with  a  set  of  input/output  pairs.  In  the 
reinforcement-learning  scenario,  the  agent  must  choose 
an  output  to  generate  in  response  to  each  input.  The 
reinforcement  signal  it  receives  indicates  only  how  suc¬ 
cessful  that  output  was;  it  carries  no  information  about 
how  successful  other  outputs  might  have  been.  In  ad¬ 
dition,  the  fact  that  the  reinforcement  signal  is  noisy 
rnccins  th&t  ssch  output  wil!  hsvs  to  bs  ^onerstod  s 
number  of  times  in  order  for  the  agent  to  acquire  an 
accurate  picture  of  which  is  better.  In  reinforcement¬ 
learning  situations,  an  agent  may  choose  an  action 
because  it  expects  it  to  have  good  results;  however, 

*This  work  was  supported  by  the  Air  Force  Office  of 
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it  may  also  choose  an  action  in  order  to  gain  in¬ 
formation  about  its  expected  results.  The  tradeoff 
between  acting  to  gain  reinforcement  and  acting  to 
gain  information  makes  this  problem  especially  in¬ 
teresting.  The  formal  foundations  of  reinforcement 
learning  have  been  widely  studied  [Kaelbling,  1989b, 
Kaelbling,  1989a,  Narendra  and  Thathachar,  1989, 
Berry  and  Fristedt,  1985,  Williams,  1986]. 

This  paper  will  focus  on  a  simple  case  of  the  re¬ 
inforcement  learning  problem  in  which  the  following 
assumptions  hold: 

•  the  agent  has  only  two  possible  actions 

•  the  reinforcement  signal  at  time  t  -f  1  reflects  only 
the  success  of  the  action  taken  at  time  t 

•  reinforcement  received  for  performing  a  particu¬ 
lar  action  in  a  particular  situation  is  1  with  some 
probability  p  and  0  with  probability  1-p  and  each 
trial  is  independent 

•  the  expected  reinforcement  value  of  doing  a  par¬ 
ticular  action  in  a  particular  input  situation  stays 
constant  for  the  entire  run  of  the  learning  algo¬ 
rithm 

Section  6  discusses  the  extension  of  the  results  in  this 
paper  to  situations  in  which  each  of  the  above  assump¬ 
tions  is  relaxed. 

2  Complexity  Versus  Efficiency 

There  are  a  number  of  good  algorithms  for  the 
reinforcement-learning  scenario  we  are  interested 
in,  including  learning-automata  algorithms  [Naren¬ 
dra  and  Thathachar,  1989],  Sutton’s  reinforcement- 
comparison  methods  [Sutton,  1984],  and  Kaelbling’s 
interval-estimation  methods  [Kaelbling,  forthcoming]. 
These  algorithms  were  originally  developed  for  the  case 
when  the  agent  has  no  inputs  other  than  reinforce¬ 
ment  and  merely  needs  to  decide  which  action  it  should 
take  ail  the  lime.  They  can  be  extended  to  the  case 
of  having  many  input  situations  simply  by  making  a 
copy  of  the  algorithm  for  each  possible  input  situation. 
This  method  works  well,  but  results  in  algorithms  with 
space  complexity  proportional,  at  least,  to  the  number 
of  possible  input  situations.  In  addition,  no  general¬ 
ization  is  exhibited,  that  is,  the  combined  algorithms 
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do  not  take  advantage  of  the  common  intuition  tiiat 
“similar”  input  situations  are  likely  to  require  “simi- 
Ijir”  actions. 

We  can  think  of  agents  as  learning  aciion  maps: 
mappings  from  input  situations  to  actions.  If  an  agent 
must  be  able  to  learn  action  maps  of  arbitrary  com¬ 
plexity,  then  the  methods  described  above  are  as  good 
as  any.  However,  if  we  restrict  the  class  of  action  maps 
that  we  expect  an  agent  to  learn,  we  can  invent  algo¬ 
rithms  for  learning  those  maps  that  are  much  more 
efficient  than  algorithms  for  the  general  case. 

A  restriction  that  has  proved  useful  to  the  concept¬ 
learning  community  is  to  the  class  of  functions  that 
can  be  expressed  as  propositional  formulae  in  ik-DNF. 
A  formula  is  said  to  be  in  disjunctive  normal  form 
(DNF)  if  it  is  syntactically  organized  into  a  disjunc¬ 
tion  of  purely  conjunctive  terms;  there  is  a  simple 
algorithmic  method  for  converting  any  formula  into 
DNF  [Enderton,  1972].  A  formula  is  in  the  class  k- 
DNF  if  and  only  if  its  representation  in  DNF  contains 
only  conjunctive  terms  of  length  k  or  less.  There  is  no 
restriction  on  the  number  of  conjunctive  terms — ^just 
their  length.  Whenever  k  is  less  than  the  number  of 
atoms  in  the  domain,  the  class  1:-DNF  is  a  restriction 
on  the  class  of  functions. 

Valiant  was  one  of  the  first  to  consider  the  re¬ 
striction  to  learning  functions  expressible  in  fc-DNF 
[Valiant,  1984,  Valiant,  1985).  He  developed  the  fol¬ 
lowing  algorithm  for  learning  functions  in  it-DNF  from 
input-output  pairs,  which  actually  only  uses  the  input- 
output  pairs  with  output  0; 

LeiT  be  the  set  of  conjunctive  terms  of  length 
k  over  the  set  of  atoms  (corresponding  to  the 
input  bits)  and  their  negations  and  let  L  be 
the  number  of  learning  instances  required  to 
learn  the  concept  to  the  desired  accuracy} 

for  i  :=:  I  to  L  do  begin 

v  :=  randomly  drawn  negative  instance 
T  :=T—  any  term  that  is  satisfied  by  v 
end 

return  T 

The  algorithm  returns  the  set  of  terms  remaining 
in  T,  with  the  interpretation  that  their  disjunction  is 
the  concept  that  was  learned  by  the  algorithm.  This 
method  simply  examines  a  fixed  number  of  negative  in¬ 
stances  and  removes  any  term  from  T  that  would  have 
caused  one  of  the  negative  instances  to  be  satisfied.^ 

The  following  sections  describe  algorithms  for  learn¬ 
ing  action  maps  in  fc-DNF  from  reinforcement  and 
present  the  results  of  an  empirical  comparison  of  their 

’This  choice  is  not  relevant  to  our  reinforcement- 
learning  scenario — the  details  are  described  in  Valiant’s 
papers  [Valiant,  1984,  Valiant,  1985). 

^Valiant’s  presentation  of  the  algorithm  defines  T  to  be 
the  set  of  conjunctive  terms  of  length  k  or  less  over  the  set 
of  atoms  and  their  negations;  however,  because  any  term 
ol  length  less  than  k  can  be  represented  as  a  disjunction  of 
terms  of  length  k,  we  use  a  smaller  set  T  for  simplicity  in 
exposition  and  slightly  more  efficient  computation  time. 


performc  ace.  For  each  algorithm,  the  inputs  are  bit- 
vectors  of  length  M,  plus  a  distinguished  reinforce¬ 
ment  bit;  the  outputs  are  single  bits. 

3  Connectionist  Methods  for 
Learning  fc-DNF 

There  has  been  interesting  work  in  the  connectionist 
community  on  learning  from  reinforcement,  which  is 
relevant  to  our  goals  because  it  focuses  on  using  more 
efficient  algorithms  to  learn  action  maps  in  a  restricted 
class  of  functions.  This  section  will  describe  three  con¬ 
nectionist  methods:  a  linear  reinforcement-comparison 
method,  a  multi-layer  backpropagation  method,  and 
a  hybrid  method  that  combines  Valiant’s  algorithm 
for  concept  learning  with  the  linear  reinforcement- 
comparison  method. 

These  and  other  algorithms  will  be  described  in  a 
standard  form  consisting  of  three  components;  so  is 
the  initial  internal  state  of  the  algorithm;  u(s,  i,  a,  r) 
is  the  update  function,  which  takes  the  state  of  the 
algorithm  s,  the  last  input  i,  the  last  action  a,  and  the 
reinforcement  value  received  r,  and  generates  a  new 
algorithm  state;  and  e(s,  t)  is  the  evaluation  function, 
which  takes  an  algorithm  state  s  and  an  input  i,  and 
generates  an  action. 

3.1  Linear  Reward- Comparison  Method 

Most  of  the  connectionist  methods  are  simple  single¬ 
layer  algorithms  that  can  learn  action  maps  in  the  class 
of  linearly  separable  functions  [Widrow  et  al,  1973, 
Sutton,  1984,  Barto  and  Anandan,  1985).  Sutton  [Sut¬ 
ton,  1984)  performed  extensive  experiments  on  such 
methods  and  found  that  reinforcement-comparison  al¬ 
gorithms  tend  to  have  the  best  performance.  The 
equations  below  define  Algorithm  8  from  his  disser¬ 
tation  [Sutton,  1984),  which  uses  a  version  of  the 
Widrow-Hoff  or  Adaline  [Widrow  and  Hoff,  1960] 
weight-update  algorithm. 

The  input  is  represented  as  an 
M -dimensional  vector  i.  The  internal  state, 

So,  consists  of  two  M -dimensional  vectors,  v 
and  w. 

u(s,  i,  a,  r)  =  let  p  :=  Vj ij 

for  j  =  I  to  M  do  begin 

Wj  :=  Wj  -1-  a(r  -  p){a  -  l/2)ij 

Vj  :=  Vj  -h  /?(r  -  p)ij 

end 

'  '  '  0  otherwise 

where  a  >  0,  0  <  /?  <  1,  and  u  is  a  normally 
distributed  random  variable  of  mean  0  and 
standard  deviation  6y. 

The  output,  e{s,i),  has  value  1  or  0  depending  on 
the  inner  product  of  w  and  i  and  the  value  of  the  ran¬ 
dom  variable  jv.  The  addition  of  the  random  value 
causes  the  algorithm  to  “experiment”  by  occasionally 
performing  actions  that  it  would  not  otherwise  have 
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taken.  The  updating  of  the  vector  w  is  somewhat  com¬ 
plicated:  each  component  is  incremented  by  a  value 
with  four  terms.  The  first  term,  a,  is  a  constant  that 
represents  the  learning  rate.  The  next  term,  r  -  p, 
represents  the  difference  between  the  actual  reinforce¬ 
ment  received  and  the  predicted  reinforcement,  p.  This 
serves  to  normalize  the  reinforcement  values:  the  abso¬ 
lute  value  of  the  reinforcement  signal  is  not  as  impor¬ 
tant  as  its  value  relative  to  the  average  reinforcement 
that  the  agent  has  been  receiving.  The  predicted  re¬ 
inforcement,  p,  is  generated  using  a  standard  linear 
associator  that  learns  to  associate  input  vectors  with 
reinforcement  values  by  setting  the  weights  in  vector  v. 
The  third  term  in  the  update  function  for  w  is  a  —  1  /2: 
it  has  constant  absolute  value  and  the  sign  is  used  to 
encode  which  action  was  taken.  The  final  term  is  ij, 
which  causes  the  jth  component  of  the  weight  vector 
to  be  adjusted  in  proportion  to  the  jth  value  of  the 
input. 

The  space  required  for  the  state,  as  well  as  time  for 
both  update  and  e -aluation  operations  is  0{M),  where 
M  is  the  number  of  input  bits. 

3.2  Multi-layer  Back-propagation  Method 

Error  back-propagation  is  a  method  for  training  con- 
nectionist  networks  that  are  comprised  of  multiple  lay¬ 
ers.  Anderson  [Anderson,  1986]  has  designed  a  connec- 
tionist  system  with  multiple  layers  that  uses  backprop- 
agation  as  a  method  for  learning  from  reinforcement. 

Anderson’s  system  uses  two  networks:  one  for  learn¬ 
ing  to  predict  reinforcement  and  one  for  learning  which 
action  to  take.  Each  of  these  is  a  two-layer  network, 
with  all  of  the  hidden  units  connected  to  all  of  the 
inputs  and  all  of  the  inputs  and  hidden  units  con¬ 
nected  to  the  outputs.  The  system  was  designed  to 
work  in  worlds  with  delayed  reinforcement  (which  are 
discussed  here  at  greater  length  in  Section  6),  but  it  is 
easily  modified  to  work  in  our  simpler  domain.  This 
algorithm  is  rather  complex,  so  space  does  not  allow 
it  to  be  described  further.  A  clear  description  can  be 
found  in  Anderson’s  dissertation  [Anderson,  1986]. 

This  method  is  theoretically  able  to  learn  very  com¬ 
plex  functions,  but  tends  to  require  many  training  in¬ 
stances  before  it  converges.  The  time  and  space  com¬ 
plexity  for  this  algorithm  is  0(Mn),  where  M  is  the 
number  of  input  bits  and  H  is  the  number  of  hidden 
units. 

3.3  A  Hybrid  Algorithm 

Given  our  interest  in  re.stricted  rlasses  of  functions, 
we  can  construct  a  new  hybrid  algorithm  for  learning 
action  maps  in  1:-DNF.  It  hinges  on  the  simple  obser¬ 
vation  that  any  such  function  can  be  expressed  as  a 
linear  combination  of  terms  in  the  set  T,  where  T  is 
the  set  of  conjunctive  terms  of  length  k  over  the  set 
of  atoms  (corresponding  to  the  input  bits)  and  their 
negations.  It  is  possible  to  take  the  original  M  bit  in¬ 
put  signal  and  transduce  it  to  a  wider  signal  that  is  the 
result  of  evaluating  each  member  of  T  on  the  original 
inputs.  We  can  use  this  new  signal  as  input  to  a  rela¬ 
tively  simple  connectionist  learning  algorithm,  such  as 


the  one  described  in  Section  3.1  above. 

If  there  are  M  input  bits,  the  set  T  has  size  C{2M,  k) 
because  we  are  choosing  from  the  set  of  bits  and  their 
negations.  However,  we  can  eliminate  all  elements  that 
contain  both  an  atom  and  its  negation,  yielding  a  set  of 
size  2’‘C{M,k).  The  space  required  by  the  algorithm, 
as  well  as  the  time  to  update  the  internal  state  or  to 
evaluate  an  input  instance,  is  proportional  to  the  size 
of  T,  and  thus,  0{M^).  It  is  important  to  note  that 
this  algorithm  (as  well  as  the  other  three  discussed  in 
this  paper)  is  strictly  incremental:  its  time  and  space 
requirements  depend  only  on  the  size  of  the  input  and 
on  the  fixed  parameter  k  and  do  not  increase  over  the 
course  of  a  run. 

4  Interval-Estimation  Algorithm  for 
it-DNF 

The  interval-estimation  algorithm  for  I:-DNF  is,  like 
the  hybrid  algorithm  described  in  Section  3.3,  based 
on  Valiant’s  algorithm,  but  the  interval-estimation  al¬ 
gorithm  uses  standard  statistical  estimation  methods 
rather  than  connectionist  weight-adjustments.  The 
technique  of  interval-estimation  has  also  been  applied 
to  other  reinforcement-learning  problems  [Kaelbling, 
forthcoming]. 

4.1  General  Description 

This  section  will  describe  the  algorithm  independent 
of  particular  statistical  tests,  which  will  be  introduced 
in  the  next  section.  We  shall  need  the  following  defi¬ 
nitions,  however.  An  input  bit-vector  satisfies  a  term 
whenever  all  the  bits  mentioned  positively  in  the  term 
have  value  1  in  the  input  and  all  the  bits  mentioned 
negatively  in  the  term  have  value  0  in  the  input.  The 
quantity  er{t,a)  is  the  expected  value  of  the  reinforce¬ 
ment  that  the  agent  will  gain,  per  trial,  if  it  generates 
action  a  whenever  term  t  is  satisfied  by  the  input  and 
action  ->a  otherwise.  The  quantity  «6r„(<,  o)  is  the  up¬ 
per  bound  of  a  100(1  -  a)%  confidence  interval  on  the 
expected  reinforcement  gained  from  performing  action 
a  whenever  term  i  is  satisfied  by  the  input.  We  can 
now  give  the  formal  definition  of  the  algorithm: 

So  =  the  set  T,  with  a  collection  of  statistics 
associated  with  each  member  of  the  set 

e{s,i)  =  for  each  t  in  S 

if  i  satisfies  t  and 
u6r„(t,l)  >  ubra(t,0)  and 
Pr(er(<,  1)  =  e7-(<,0))  < 
ihcTi  r^iuvTi  1 
return  0 

ti(s,  i,  a,  r)  =  for  each  t  in  S 

update Jerm.statistics(t,  i,  a,  r) 
return  s 

At  any  moment  in  the  operation  of  this  algorithm, 
we  can  extract  a  symbolic  description  of  its  current 
hypothesis.  It  is  the  disjunction  of  all  terms  t  such  that 
ubra{t,  1)  >  ubra{t,Q)  and  Pr(er(t,  1)  =  er(t,0p  <  /?. 
This  is  the  I:-DNF  expression  according  to  which  the 
agent  is  choosing  its  actions. 
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The  evaluation  criterion  is  chosen  in  such  a  way  as 
to  make  the  important  trade-off  between  acting  to  gain 
information  and  acting  to  gain  reinforcement.  A  naive 
method  would  be  for  each  term  to  generate  a  1  when¬ 
ever  action  1  has  had  a  higher  success  rate  than  action 
0.  This  would  be  a  very  bad  strategy,  however,  be¬ 
cause  if  the  first  trial  of  action  0  failed,  its  success  rate 
would  be  0,  causing  action  0  never  to  be  chosen  again. 
The  interval  estimation  method  works  because  of  the 
fact  that  the  value  of  ubr.  can  be  high  for  two  rea¬ 
sons.  It  may  be  high  because  the  confidence  interval 
is  very  large  due  to  the  action  not  having  been  tried 
very  often — this  will  cause  the  action  to  be  chosen  in 
order  to  gain  information.  The  upper  bound  may  also 
be  high  because  the  confidence  interval  is  small  and 
the  action  has  a  genuinely  high  payoff— this  will  cause 
an  action  to  be  chosen  in  order  to  gain  reinforcement. 
At  the  beginning  of  a  course  of  execution  of  this  al¬ 
gorithm,  actions  are  chosen  almost  at  random,  until 
the  upper  bound  of  the  worse  action  is  driven  down 
by  sampling,  while  the  upper  bound  of  the  other  stays 
high.  The  value  of  a  determines  the  size  of  the  confi¬ 
dence  interval;  when  it  is  small  the  confidence  interval 
is  large  and  the  algorithm  is  very  conservative.  It  is 
not  likely  to  converge  to  the  wrong  action,  but  it  may 
take  a  long  time  to  converge.  As  a  is  increased,  the 
confidence  intervals  become  smaller,  the  learning  rate 
faster,  and  the  chance  of  gross  error  higher. 

Let  the  equivalence  probability  ot  a  term  be  the  prob¬ 
ability  that  the  expected  reinforcement  is  the  same  no 
matter  what  choice  of  action  is  made  when  the  term  is 
satisfied.  The  second  requirement  for  a  term  to  cause 
a  1  to  be  emitted  is  that  the  equivalence  probability  be 
small.  Without  this  criterion,  terms  for  which  no  ac¬ 
tion  is  better  will,  roughly,  alternate  between  choosing 
action  1  and  action  0.  Because  the  output  of  the  entire 
algorithm  will  be  1  whenever  any  term  has  the  value 
1,  this  alternation  of  values  can  cause  a  large  number 
of  wrong  answers.  Thus,  if  we  can  convince  ourselves 
that  a  term  is  irrelevant  by  showing  that  its  choice  of 
action  makes  no  difference,  we  can  safely  ignore  it. 


where  20/2  is  such  that  Pt(Z  >  Za/2)  =  < 

~^al2)  =  0/2  when  Z  is  a  standard  normal  ran¬ 
dom  variable  [Larsen  and  Marx,  1986].  This  allows 
us  to  define  u6ra(t,0)  as  h{so,no,a)  and  ubra(t,l)  as 
h{si,ni,a),  where  sq,  Uq,  Si,  and  ni  are  the  statistics 
associated  with  term  t. 

To  test  for  equality  of  the  underlying  Bernoulli  pa¬ 
rameters,  we  use  a  two-sided  test  at  the  /?  level  of 
significance  that  rejects  the  hypothesis  that  the  pa¬ 
rameters  are  equal  whenever 

ifl.  _  UL  I  <  —Zpl2 

-  is  either  <  or  , 

[  >  +2^/2 

no«i 

where  2^5/2  is  a  standard  normal  deviate  [Larsen  and 
Marx,  1986].  Because  sample  size  is  important  for  this 
test,  the  algorithm  is  slightly  modified  to  ensure  that, 
at  the  beginning  of  a  run,  each  action  is  chosen  a  min¬ 
imum  number  of  times,  referred  to  by  the  parameter 

/?min  • 

The  complexity  of  this  algorithm  is  the  same  order 
as  that  of  (he  hybrid  connectionist  algorithm  of  Section 
3.3,  namely  0(M*). 

5  Empirical  Comparison 

This  section  reports  the  results  of  a  set  of  experiments 
designed  to  con.pare  the  performance  of  the  algorithms 
discussed  in  this  paper. 

5.1  Algorithms  and  Environments 

The  following  algorithms  were  tested  in  these  experi¬ 
ments: 

•  LINCONN  Linear  reinforcement-comparison  algo¬ 
rithm 

•  UNCONN-b  Linear  reinforcement-comparison  with 
an  extra  input  wired  to  have  a  constant  value 

•  CONNKDNF  Hybrid  connectionist  algorithm  for  k- 
DNF 


4.2  Statistics 

In  the  simple  reinforcement-learning  scenario  we  are 
considering,  the  necessary  statistical  tests  are  also 
quite  simple.  For  each  term,  we  store  the  following 
statistics:  no,  the  number  of  trials  of  action  0;  so,  the 
number  of  successes  of  action  0;  ni,  the  number  of 
trials  of  action  1;  and  si,  the  number  of  successes  of 
action  1.  These  statistics  are  incremented  only  when 
the  associated  term  is  satisfied  by  the  current  input 
instance. 

If  n  is  the  number  of  trials  and  s  the  number  of 
successes  arising  from  a  series  of  Bernoulli  trials  with 
success  probability  p,  the  upper  bound  of  a  100(1  -  a) 
percent  confidence  interval  for  p  can  be  approximated 

by 


h{s,n,Q) 


•  lEKDKF  Interval-estimation  algorithm  for  I:-DNF 

•  DP  Anderson’s  error  back-propagation  algorithm 

•  IE  Basic  interval-estimation  algorithm 

The  basic  interval-estimation  algorithm  lE  [Kaelbling, 
forthcoming]  is  included  as  a  yardstick,  it  is  computa¬ 
tionally  much  more  complex  than  the  other  algorithms 
and  will  very  likely  out-perform  them. 

Each  of  the  algorithms  was  tested  in  three  different 
environments.  The  environments  are  calLd  btnoimal 
Boolean  expression  worlds  and  can  be  characterized  by 
the  follov/ing  parameters;  M,  expr,  pu,  pi„,  po,,  and 
Pon-  The  parameter  M  is  the  number  of  input  bits, 
erpr  is  a  Boolean  expression  over  the  input  bits,  pi,  is 
the  probability  of  receiving  reinforcement  value  1  given 
that  action  1  is  taken  when  the  input  instance  satisfies 
expr,  Pin  is  the  probability  of  receiving  reinforcement 
value  1  given  that  action  1  is  taken  when  the  input 
instance  does  not  satisfy  expr,  po,  is  the  probability 
of  receiving  reinforcement  value  1  given  that  action  0 
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Task 

M 

Pu 

Pin 

POi 

POn 

1 

3 

.6 

.4 

.4 

.6 

2 

3 

.9 

.1 

.1 

.9 

3 

6 

.9 

.5 

.6 

.8 

Table  1:  Parameters  of  test  environments  for  ib-DNF 
experiments. 

is  taken  when  the  input  instance  satisfies  expr,  and 
pon  is  the  probability  of  receiving  reinforcement  value 
1  given  that  action  0  is  taken  when  the  input  instance 
does  not  satisfy  expr.  Input  vectors  are  chosen  by  the 
world  according  to  a  uniform  probability  distribution. 

Table  1  shows  the  values  of  these  parameters  for  each 
task.  The  first  task  has  the  simple,  linearly  separable 
expression  (I'o  A  t'l)  V  (I'l  A  12);  what  makes  it  diffi¬ 
cult  is  the  small  separation  between  the  reinforcement 
probabilities.  Task  2  has  highly  differentiated  rein¬ 
forcement  probabilities,  but  the  function  to  be  learned, 
(ioA-'fi)V(ii  A-<i2)V(i2A-'io),  is  a  complex  exclusive- 
or.  Finally,  Task  3  is  the  simple  conjunctive  function, 
12  A  -u's,  but  all  of  the  reinforcement  probabilities  are 
high  and  there  are  6  input  bits  rather  than  only  3. 

5.2  Parameter  Tuning 

Each  of  the  algorithms  has  a  set  of  parameters.  For 
both  lEKDNF  and  CONNKDNF,  k  -  2.  The  simple 
connectionist  algorithms  LINCONN  and  LlNCONN-f  as 
well  as  CONNKDNF  have  parameters  a,  p,  and  <r.  Fol¬ 
lowing  Sutton  [Sutton,  1984],  parameters  /?  and  <r  in 
CONNKDNF,  UNCONN,  and  LINCONN-f  will  be  fixed  to 
have  values  .1  and  .3,  respectively.  The  lEKDNF  al¬ 
gorithm  has  two  confidence-interval  parameters, 
and  zp/2,  and  a  minimum  age  for  the  equality  test 
Pmin,  while  the  lE  algorithm  has  only  Za/2-  Finally, 
the  BP  algorithm  has  a  large  set  of  parameters:  j3, 
learning  rate  of  the  evaluation  output  units;  /?/,,  learn¬ 
ing  rate  of  the  evaluation  hidden  units;  p,  learning  rate 
of  the  action  output  units;  and  /?/,,  learning  rate  of  the 
action  hidden  units.  In  each  of  the  tasks,  the  bp  algo¬ 
rithm  had  as  many  hidden  units  as  inputs. 

All  of  the  parameters  for  each  algorithm  were  be 
chosen  to  optimize  the  behavior  of  tiiat  algorithm  on 
the  chosen  task.  The  success  of  an  algorithm  was  mea¬ 
sured  by  the  average  reinforcement  received  per  tick, 
averaged  over  the  entire  run.  For  each  algorithm  and 
environment,  a  seri'is  of  100  trials  of  length  3000  were 
run  with  different  parameter  values.  Table  2  shows  the 
best  set  of  parameter  values  found  for  each  algorithm- 
environment  pair. 

5.3  Results 

Having  chosen  the  best  parameter  values  for  each  al¬ 
gorithm  and  environment,  the  performance  of  the  al¬ 
gorithms  was  compared  on  runs  of  length  3000  using 
the  parameter  settings  of  Table  2.  The  performance 
metric  was  average  reinforcement  per  tick,  averaged 
over  the  entire  run.  The  results  are  shown  in  Table 
3,  together  with  the  expected  reinforcement  of  execut¬ 
ing  a  completely  random  behavior  (choosing  actions 


ALG-TASK 

1 

3 

UNCONN 

a 

.0625 

.125 

.125 

LINCONN-f 

a 

.125 

.0625 

.25 

CONNKDNF 

tr 

.125 

.25 

.03125 

lEKDNF 

^a/2 

3 

3.5 

2.5 

2/9/2 

1 

2.5 

3.5 

^mm 

15 

5 

25 

BP 

P 

.1 

.25 

.1 

Pu 

.2 

.3 

.05 

P 

.15 

.15 

.35 

Ph 

.2 

.05 

.1 

IE 

^a/2 

3.0 

1.5 

2.5 

Table  2:  Best  parameter  values  for  each  I:-DNF  algo¬ 
rithm  in  each  environment. 


ALG-TASK 

1 

2 

3 

LINCONN 

.5329 

.7418 

.7769 

LlNCONN-f 

.5456 

.7459 

.7722 

CONNKDNF 

.5783 

.8903 

.7825 

lEKDNF 

.5789 

.8900 

.7993 

BP 

.5456 

.7406 

.7852 

IE 

.5827 

.8966 

.7872 

random 

.5000 

.5000 

.6750 

optimal 

.6000 

.9000 

.8250 

Table  3:  Average  reinforcement  for  I:-DNF  problems 
over  100  runs  of  length  3000. 
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Figure  1:  Significant  dominance  partial  order  among 
1:-DNF  algorithms  for  Task  1. 


Figure  2:  Significant  dominance  partial  order  among 
fc-DNF  algorithms  for  Task  2. 


0  and  1  with  equal  probability)  and  of  executing  the 
optimal  behavior.  These  results  do  not  tell  the  entire 
story,  however.  It  is  important  to  test  for  statistical 
significance  to  be  relatively  sure  that  the  ordering  of 
one  algorithm  over  another  did  not  arise  by  chance. 
Figures  1,  2  and  3  show,  for  each  task,  a  pictorial  rep¬ 
resentation  of  the  results  of  a  1-sided  t-test  applied  to 
each  pair  of  experimental  results.  The  graphs  encode  a 
partial  order  of  significant  dominance,  with  solid  lines 
representing  significance  at  the  .95  level  and  dashed 
lines  representing  significance  at  the  .85  level. 

With  the  best  parameter  values  for  each  algorithm, 
it  is  also  of  some  interest  to  compare  the  rate  at  which 
performance  improves  as  a  function  of  the  number  of 
training  instances.  Figures  4,  5,  and  6  show  superim¬ 
posed  plots  of  the  learning  curves  for  each  of  the  al¬ 
gorithms.  Each  point  represents  the  average  reinforce¬ 
ment  received  over  a  sequence  of  100  steps,  averaged 
over  100  runs  of  length  3000. 

5.4  Discussion 

On  Tasks  1  and  2  the  basic  interval-estimation  algo¬ 
rithm,  lb,  performed  sigiiiricaiilly  better  than  any  of 
the  other  algorithms.  The  magnitude  of  its  superior¬ 
ity,  however,  is  not  extremely  great — Figures  4  and 
5  reveal  that  the  iekdnf  and  connkdnf  algorithms 
have  similar  performance  characteristics  both  to  each 
other  and  to  lE.  On  these  two  tasks,  the  overall  per¬ 
formance  of  IEKDNF  and  CONNKDNF  were  not  found  to 
be  significantly  different. 

The  backpropagation  algorithm,  BP,  performed  con¬ 
siderably  worse  than  expected  on  Tasks  1  and  2.  It 
is  very  difficult  to  tune  the  parameters  for  this  algo¬ 
rithm,  so  its  bad  performance  may  be  explained  by 


I 

•  LINCONN+ 


Figure  3:  Significant  dominance  partial  order  among 
ik-DNF  algorithms  for  Task  3. 


Figure  4:  Learning  curves  for  Task  1. 

max 


Figure  5:  Learning  curves  for  Task  2. 


Figure  6;  Learning  curves  for  Task  3. 
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a  sub-optimal  setting  of  parameters.^  However,  it  is 
possible  to  see  in  the  learning  curves  of  Figures  4  and 

5  that  the  performance  of  BP  was  still  increasing  at 
the  ends  of  the  runs.  This  may  indicate  that  with 
more  training  instances  it  would  eventually  converge 
to  optimal  performance. 

The  simple  linear  connectionist  algorithms  per¬ 
formed  poorly  on  both  Tasks  1  and  2.  This  poor  per¬ 
formance  was  expected  on  Task  2,  because  such  algo¬ 
rithms  are  known  to  be  unable  to  learn  non-linearly- 
separable  functions.  Task  1  is  difficult  for  these  al¬ 
gorithms  because,  during  the  execution  of  the  algo¬ 
rithm,  the  evaluation  function  is  often  too  complex  to 
be  learned  by  the  simple  linear  associator.  Adding  a 
constant  input  value  to  the  simple  linear  connectionist 
algorithm  made  a  significant  improvement  in  perfor¬ 
mance;  this  is  not  surprising,  because  it  allows  dis¬ 
crimination  hyperplanes  that  do  not  pass  through  the 
origin  of  the  space  to  be  found. 

Task  3  reveals  many  interesting  strengths  and  weak¬ 
nesses  of  the  algorithms.  One  of  the  most  interesting 
is  that  IE  is  no  longer  the  best  performer.  Because  the 
target  function  is  simple  and  there  is  a  larger  num¬ 
ber  of  input  bits,  the  ability  to  generalize  across  input 
instances  becomes  important.  The  lEKDNF  algorithm 
is  able  to  find  the  correct  hypothesis  early  during  the 
run  (this  is  apparent  in  the  learning  curve  of  Figure 
6).  However,  because  the  reinforcement  values  are  not 
highly  differentiated  and  because  the  size  of  the  set  T 
is  quite  large,  it  begins  to  include  extraneous  terms  due 
to  statistical  fluctuations  in  the  environment,  causing 
slightly  degraded  performance. 

The  IE,  BP,  and  connkdnf  algorithms  all  have  very 
similar  performance  on  Task  3,  with  the  linear  connec¬ 
tionist  algorithms  performing  slightly  worse,  but  still 
reasonably  well. 

6  Relaxing  the  Assumptions 

This  section  will  discuss  the  consequences  of  relaxing 
the  assumptions  made  at  the  beginning  of  this  paper, 
especially  in  the  context  of  the  two  better-performing 
algorithms,  lEKDNF  and  connkdnf.  In  some  cases, 
simple  changes  can  be  made  to  the  algorithms  that  will 
allow  them  to  work  in  the  more  general  situations.  In 
others,  there  are  theoretical  problems  that  make  ex¬ 
tensions  difficult.  Each  of  the  concrete  extensions  pro¬ 
posed  to  the  lEKDNF  algorithm  has  been  implemented 
and  tested. 

Thus  far  we  have  assumed  that  the  afent  has  onlv 

V*  V 

two  possible  actions.  Many  of  the  early  learning- 
automata  algorithms  are  directly  applicable  to  prob¬ 
lems  with  more  than  two  actions.  It  has  also  been 
shown  (Kaelbling,  forthcoming]  that  the  problem  of 
generating  actions  specified  by  N  output  bits  can  be 

®In  the  parameter  tuning  phase,  the  parameters  were 
varied  independently — it  may  well  be  necessary  to  perform 
gradient-ascent  search  in  the  parameter  space,  but  that  is 
a  computationally  difficult  task,  especially  when  the  eval¬ 
uation  of  any  point  in  parameter  space  may  have  a  high 
degree  of  noise. 


solved  by  N  interconnected  modules  that  learn  to  gen¬ 
erate  one  output  bit  from  reinforcement.  Thus,  the 
algorithms  presented  here  could  be  applied,  using  this 
method,  to  problems  with  many  possible  outputs. 

The  problem  of  delayed  reinforcement  has  been 
addressed  by  Sutton  [Sutton,  1988]  and  Watkins 
[Watkins,  1989],  among  others.  Sutton’s  solution, 
called  the  temporal  difference  method  (TD)  can  be 
abstracted  away  from  the  particular  reinforcement¬ 
learning  mechanism  being  used.  It  provides  a  mod¬ 
ule  that  learns  to  transduce  the  delayed  reinforcement 
signal  that  is  coming  from  the  world  into  an  immedi¬ 
ate  reinforcement  signal  that  evaluates  each  state  of 
the  world  to  be  the  expected  future  reward  based  on 
the  agent’s  current  strategy.  Because  this  local  rein¬ 
forcement  signal  must  be  learned,  using  a  TD  module 
violates  a  different  one  of  our  assumptions:  that  the 
expected  reinforcement  of  performing  an  action  in  a 
situation  be  fixed  over  the  course  of  a  run.  This  will 
be  addressed  below. 

If  the  reinforcement  provided  by  the  world  cannot 
be  modeled  as  independent  trials  of  some  sort,  then 
it  is  very  difficult  to  use  explicit  statistical  methods. 
The  connectionist  algorithms  are  implicitly  statistical 
and  would  also  have  trouble  in  such  worlds.  How¬ 
ever,  if  the  trials  are  independent,  we  have  a  variety  of 
different  statistical  models  available.  The  CONNKDNF 
algorithm,  as  presented,  can  be  used  when  the  rein¬ 
forcement  is  real-valued.  The  lEKDNF  algorithm  can 
be  implemented  with  different  statistical  tests.  For 
instance,  if  we  know  that  the  reinforcement  values  for 
each  input-action  pair  are  normally  distributed,  we  can 
use  standard  statistical  methods  to  construct  confi¬ 
dence  intervals  and  to  test  for  equality  of  means.  If  we 
have  no  model,  we  can  use  non-parametric  methods. 

Finally,  we  consider  the  case  of  having  the  expected 
reinforcement  of  performing  an  action  in  a  situation 
change  during  the  course  of  a  run.  The  CONNKDNF 
algorithm  will  work  in  such  cases,  although  it  might  be 
necessary  to  adjust  its  parameters.  The  statistically- 
based  lEKDNF  algorithm  can  be  modified  to  work,  by 
causing  its  statistics  to  decay  over  time.  If  an  action 
has  not  been  tried  for  a  long  time,  its  n  value  will 
slowly  decay,  which  will  cause  its  confidence  interval 
to  grow  larger.  Eventually  it  will  grow  large  enough 
for  that  action  to  be  chosen  again.  If  the  action  has 
good  results,  the  policy  will  be  changed  to  favor  this 
action. 

7  Conclusion 

From  this  study,  we  can  see  that  it  is  useful  to  de¬ 
sign  algorithms  that  are  tailored  to  learning  certain 
restricted  classes  of  functions.  The  two  specially- 
designed  algorithms  u-r  out-performed  standard  meth¬ 
ods  of  comparable  complexity.  The  connkdnf  and 
lEKDNF  algorithms  each  have  their  strengths  and 
weaknesses.  It  is  possible  that  connkdnf  may  out¬ 
perform  lEKDNF  to  some  extent  because  in  CONNKDNF 
each  term  gets  to  contribute  to  the  answer  with  differ¬ 
ent  degrees.  This  avoids  errors  that  occur  in  lEKDNF 
when  a  single  term  is  barely  over  the  threshold  for  gen- 
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erating  a  1.  On  the  other  hand,  the  state  of  lEKDNF 
has  internal  semantics  that  are  clear  and  directly  in¬ 
terpretable  in  the  language  of  classical  statistics.  This 
simplifies  the  process  of  extending  the  algorithm  to 
apply  to  other  types  of  worlds  in  a  principled  manner. 

Important  future  work  will  be  to  identify  other  re¬ 
stricted  classes  of  functions  that  can  be  learned  effi¬ 
ciently  and  effectively  from  reinforcement  and  demon¬ 
strate  that  these  classes  contain  functions  that  solve  in¬ 
teresting  and  important  problems  from  the  real  world. 
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Abstract 

The  most  frequently  used  measure  of  per¬ 
formance  for  reinforcement  learning  algo¬ 
rithms  is  learning  rate.  That  is,  how  many 
learning  trials  are  required  before  the  pro¬ 
gram  is  able  to  perform  its  task  adequately. 

In  this  paper,  we  argue  that  this  is  not 
necessarily  the  best  measure  of  perform¬ 
ance  and,  in  some  cases,  can  even  be  mis¬ 
leading.  In  control  tasks,  such  as  pole  bal¬ 
ancing,  we  have  found  that  a  program  that 
learns  to  balance  the  pole  quickly  produces 
a  control  strategy  that  is  so  specific  as  to 
make  it  impossible  to  transfer  expertise 
from  one  related  task  to  another.  We  exam¬ 
ine  the  reasons  for  this  and  suggest  ways  of 
obtaining  general  control  strategies.  We 
also  make  the  conjecture  that,  as  a  broad 
principle,  there  is  a  trade-off  between  rapid 
learning  rate  and  the  ability  to  generalise. 

We  also  introduce  methods  for  analysing 
the  results  of  reinforcement  learning  algo¬ 
rithms  to  produce  readable  control  rules. 

1.  Introduction 

The  most  frequently  used  measure  of  performance 
for  reinforcement  learning  algorithms  such  as  connec- 
tionist  and  genetic  algorithms  is  learning  rate.  That  is, 
how  many  learning  trials  are  required  before  the  pro¬ 
gram  is  able  to  perform  its  task  adequately.  In  this 
paper,  we  argue  that  this  is  not  necessarily  the  best 
measure  of  performance  and,  in  some  cases,  can  even 
be  misleading.  We  illustrate  our  argument  with  a  spe¬ 
cific  learning  task,  namely,  pole  balancing. 

We  are  especially  interesting  in  learning  control 
tasks  (or  skill  acquisition)  and  one  of  our  primary  goals 
is  to  be  able  to  create  a  controller  capable  of  dealing 
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with  a  broad  range  of  problems  of  the  same  type.  In  the 
course  of  performing  experiments  with  a  variety  of 
learning  algorithms  (Sammut,  1988)  we  discovered 
that,  while  these  algorithms  could  learn  to  balance  a 
pole,  the  control  strategy  that  had  been  acquired  could 
not  be  frozen  and  transferred  to  another  attempt  at  bal¬ 
ancing  a  pole.  In  other  words,  the  control  strategy  that 
had  been  learned  was  very  specific.  Furthermore,  the 
more  rapidly  the  program  learnt  to  control  the  pole, 
the  more  specific  was  its  control  strategy.  We  will  give 
reasons  why  this  should  be  so  and  suggest  ways  of  creat¬ 
ing  more  general  control  strategies. 

The  first  method,  called  “voting”  requires  the 
learning  task  to  be  repeated  a  number  of  times  and  the 
results  of  these  learning  runs  are  collected  into  a  single 
controller  that  is  very  robust.  We  then  examine  how  the 
results  of  this  style  of  reinforcement  learning  can  be 
turned  into  general  and  readable  control  rules. 

All  of  the  experiments  described  here  use  the 
BOXES  a'gorithm  (Michie  and  Chambers,  1968)  as  a 
starting  point,  however,  we  believe  that  our  con¬ 
clusions  are  applicable  to  most  reinforcement  learning 
algorithms,  including  connectionist  and  genetic  algo¬ 
rithms.  Indeed,  we  make  the  conjecture  that,  as  a  broad 
principle,  there  is  a  trade-off  between  rapid  learning 
rate  and  the  ability  to  generalise.  This  was  character¬ 
ised  by  Michie  and  Chambers  as  “exploration  vs.  ex¬ 
ploitation”. 

In  the  following  section  we  describe  why  rapid 
learning  rates  do  not  lead  to  general  solutions.  The  fol¬ 
lowing  sections  described  a  number  experiments  aimed 
at  building  more  general  controllers. 

2,  The  Problem 

The  pole  balancing  problem  will  only  be  described 
very  briefly  since  it  has  been  the  subject  of  a  number 
previous  papers  and  we  are  more  interested  in  looking 
at  the  results  of  the  learning  algorithms  and  what  to  do 
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Figure  1:  The  162  boxes  displayed  as  a  four-dimensional  array 


after  they  have  done  their  job.  Sammut  (1988)  gives  a 
survey  of  previous  work.  The  pole  and  cart  problem  can 
be  stated  as  follows: 

A  rigid  pole  is  hinged  to  a  cart,  which  is  free 
to  move  within  the  limits  of  a  track.  The 
learning  system  attempts  to  keep  the  pole 
balanced  and  the  cart  within  its  limits  by  ap¬ 
plying  a  force  of  fbced  magnitude  to  the  cart, 
either  to  the  left  or  to  the  right.  (Selfridge, 
Sutton  and  Barto,  1985) 


mapped  to  the  same  decision  (Michie  and  Chambers, 
1968).  In  the  experiments  reported  here,  all  the  learn- 
ing  algorithms  were  tested  against  a  simulation  of  the 
pole  and  cart.  The  simulation  uses  the  equations  of  mo¬ 
tion  as  described  by  Anderson  (1987),  We  use  the  same 
partitions  as  Selfridge,  Sutton  and  Barto  (1985): 


cart  position: 
cart  velocity: 
pole  angle: 
angular  velocity. 


±  0.8,  ±  2.4  metres 
±  0.5,  ±  00  metres/sec 
0,  ±  1,  ±  6,  ±  12  degrees 
±  50,  ±  oo  degrees/sec 


Four  parameters  describe  the  state  of  the  pole  and 
cart  at  any  instance  in  time.  They  are:  the  position  of 
the  cart  on  the  track  (x),  the  velocity  of  the  cart  (;r ), 
the  angle  of  the  pole  (  0 ),  the  angular  velocity  of  the 
pole  ( 0 ).  These  parameters  contain  sufficient  informa¬ 
tion  to  allow  a  controller  to  correctly  decide  whether  to 
push  left  or  right. 

The  problem  space  can  be  thought  of  as  a  four  di¬ 
mensional  space,  where  each  dimension  is  defined  by 
one  of  the  four  state  variables.  Translating  this  to  a 
computational  structure  creates  a  four  dimensional 
look-up  table.  Given  infinite  storage,  each  point  in  the 
space  contains  a  boolean  value  which  tells  the  system  to 
push  left  at  that  point  or  to  push  right.  A  practical  ap¬ 
proximation  to  this  approach  is  to  partition  the  state 
space  so  that  neighbouring  points  within  a  region  are 


This  creates  162  “boxes”  that  fill  the  problem  space. 
After  one  time  step  in  the  simulation,  the  system  will 
land  in  one  box.  The  next  move  is  determined  by  the 
action  setting  of  that  box  i.e.  push  left  or  push  right.  We 
will  display  the  action  settings  in  the  set  of  boxes  as  an 
array  of  zeros  and  ones  as  in  Figure  1. 

If  the  pole  and  cart  system  falls  in  a  box  containing  a 
zero  then  the  next  control  action  is  to  push  left.  If  it  falls 
in  a  box  containing  a  one  then  the  control  action  is  to 
push  right.  Other  pole  balancing  algorithms  need  not 
represent  the  state  space  as  explicitly  as  we  do  here. 
However,  later  we  will  argue  that  the  conclusions 
drawn  from  these  experiments  are  applicable  to  other 
systems. 

Suppose  we  allow  the  program  to  learn  how  to  con¬ 
trol  the  pole  and  cart  system  by  conducting  a  number  of 
trials,  i.e.  attempts  at  balancing  the  pole,  until  it  finally 
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Figure  2:  State  transition  representation  of  pole  and  cart  problem 


succeeds.  Let  us  then  freeze  the  learning  element  and 
use  the  control  strategy  that  has  been  learnt  on  a  new 
trial  where  the  pole  and  cart  are  placed  in  a  random, 
but  recoverable,  initial  state.  This  can  be  repeated  a 
number  of  times  to  obtain  the  average  length  of  time 
that  the  controller  is  able  to  avoid  failure.  All  of  the 
learning  algorithms  tested  by  Sammut  (1988),  including 
BOXES  (Michie  and  Chambers,  1968),  the  AHC  algo¬ 
rithm  (Sutton,  1984)  and  a  genetic  algorithm  (Odetayo, 
1988),  could  not  successfully  balance  the  pole  again 
even  for  a  short  time.  Tb  see  the  reason  for  this,  let  us 
look  at  another  way  of  representing  the  problem.  Imag¬ 
ine  that  each  box  represents  one  node  in  a  non-detcr- 
ministic  state  transition  diagram  as  in  Figure  2.  On 
entering  a  node,  the  program  must  chose  whether  to 
push  left  or  rignt,  i.e.,  exit  the  node  on  a  zero  transition 
or  a  one. 

Where  possible,  the  program  should  choose  a 
transition  that  will  put  the  system  into  a  cycle  since  a 
cycle  indicates  that  the  pole  and  cart  are  avoiding  the 
failure  nodes  (shown  as  heavy  circles).  The  dotted  lines 
in  the  diagram  indicate  a  possible  cycle.  Note  that 
choosing  a  0  (left  push)  or  1  (right  push)  does  not  uni- 
niiplv  determine  which  node  the  svstem  wMi  entersince 
non-determinism  has  been  introduced  by  approximat¬ 
ing  each  point  in  the  state  space  by  a  region  or  box.  For 
example,  the  node  marked  ‘X’  in  Figure  2  has  two  out¬ 
going  edges  labelled  T’.  If  the  system  were  to  land  in 
this  node  and  the  action  in  ‘X’  is  a  right  push,  then  we 
cannot  predict  which  of  the  two  neighbouring  nodes 
reachable  by  the  ‘1’  edges  would  be  entered. 


When  a  program  learns  to  balance  the  pole  quickly, 
that  means  that  it  has  been  able  to  rapidly  find  a  rycle  in 
which  to  keep  the  system  and  because  the  success  cri¬ 
terion  is  keeping  the  pole  balanced,  the  program  does 
not  explore  other  parts  of  the  graph.  Looking  at  the 
array  of  boxes  in  Figure  1,  one  can  imagine  that  the  pro¬ 
gram  learns  reliable  settings  for  a  small  subset  of  boxes 
and  if  the  pole  and  cart  system  can  be  kept  within  that 
reliable  set  of  boxes  then  the  pole  will  remain  balanced. 
Since  very  little  of  the  problem  space  has  been  ex¬ 
plored,  if  the  system  is  restarted  with  different  initial 
parameters,  then  the  program  will  be  lacking  in  expert¬ 
ise  and  fail  to  keep  the  pole  balanced. 

While  this  effect  can  be  observed  using  the  BOXES 
representation,  we  claim  that  this  also  applies  to  other 
representations.  For  example,  ifaconnectionistsystem 
is  used  and  it  has  a  number  of  hidden  units  and  inputs 
directly  from  the  four  state  variables,  the  learning  pro¬ 
gram  must  explore  a  large  space  of  settings  for  weights 
before  settling  on  a  solution.  If  the  sole  criterion  for 
success  is  keeping  the  pole  balanced  just  once,  then 
very  little  of  the  space  of  weight  combinations  need  be 
explored.  Thus,  as  in  other  systems  that  learn  by 

pvamnlf*  in  orHf^r  fn  optipralicp  fhp  nrnnrom  mnet  caa 
...  ...»  . . 

number  of  instances  of  a  concept.  However,  here  the 

instances  are  complete  control  strategies,  in  our  case, 

represented  by  a  set  of  boxes. 

Selfridge,  Sutton  and  Barto  (1985)  have  shown  that 
once  the  pole  balancing  task  has  been  learned  for  one 
configuration  of  the  system,  learning  algorithms  can  be 
created  that  will  learn  more  rapidly  how  to  control  a 
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new  configuration.  So  is  it  still  desirable  to  be  able  to 
find  a  single  control  strategy?  In  a  system  that  docs  not 
change  too  radically  it  would  obviously  be  less  wasteful 
not  to  have  to  re-leam  each  time  a  new  control  task  ar¬ 
ises.  More  importantly,  in  regular  systems  such  as  pole 
balancing  (Makarovic,  1987)  and  satellite  control  (Sam- 
mut  and  Michie,  1989),  it  can  be  shown  that  very  simple 
control  rules  can  be  effective.  We  wish  to  explore  the 
possibility  of  learning  these  rules. 

In  the  following  section  we  give  one  method  for  de¬ 
riving  a  robust  control  strategy. 

3.  Experiment  1;  Voting 

The  apparent  cause  of  the  learning  algorithm’s 
“over  specialisation”  is  that  it  converges  too  quickly  on 
one  strategy  for  balancing  the  pole  and  therefore  does 
not  try  to  explore  more  of  the  problem  space.  Our  sol¬ 
ution  to  this  problem  requires  that  we  derive  a  number 
of  these  specialised  control  strategies  and  produce  a 
generalisation  of  them.  The  method  can  be  viewed  as 
running  a  number  of  independent  versions  learning  of 
the  learning  algorithm  and  averaging  the  outputs  (cf. 
Buntine,  1989). 

Tb  test  the  reliability  of  a  control  strategy,  we  freeze 
the  boxes  and  then  use  them  to  tiy  to  control  the  pole 
and  cart  from  a  new  random  starting  position.  This  is 
repeated  20  times.  The  starting  range  is  restricted  to  re¬ 
gions  of  the  problem  space  from  which  it  is  possible  to 
avoid  failure,  e.g.  we  do  not  start  the  pole  learning  so 
far  over  that  it  is  impossible  to  swing  it  back  towards 
vertical.  If  the  pole  can  be  balanced  for  10,000  time 
steps  in  20  out  of  20  different  runs  then  we  deem  the 
control  strategy  to  be  reliable.  After  one  learning  run 
whose  success  criterion  is  10,000  time  steps  we  usually 
get  0/20  performance  on  the  reliability  test.  Even  when 
the  criterion  is  set  at  100,000  we  do  little  better.  Thus,  a 
single  long  learning  run  still  does  not  guarantee  good 
search  of  the  state  space.  Details  of  all  of  the  experi¬ 
ments  described  in  this  paper  are  given  in  Cribb  (1989). 

To  collect  information  from  a  number  of  successful, 
but  specific,  control  strategies  we  require  each  box  to 
maintain  a  count  of  “votes”  for  each  possible  control 
action  and  at  the  end  of  a  successful  trial,  the  box  casts  a 
vote  for  the  action  to  which  it  is  currently  set.  Thus, 
after  learning  to  balance  the  pole  we  scan  each  of  the 
162  boxes  and  if  a  box  has  an  action  setting  of  ‘0’  then 
the  ‘0’  vote  is  incremented  by  1  and  similarly  boxes  that 
have  an  action  setting  of  T’  have  the  T’  vote  in¬ 
cremented.  The  program  then  starts  the  next  trial  as 


usual.  In  one  experiment,  after  832  trials,  100  trials  had 
been  successful,  thus  giving  100  rounds  of  voting.  The 
majority  vote  from  each  box  was  extracted  and  used  to 
create  a  new  set  of  boxes  that  could  be  tested  for  relia¬ 
bility.  That  is,  each  box  in  the  new  set  of  boxes  was  ob¬ 
tained  by  comparing  the  ‘0’  and ‘T  votes  and  if  ‘0’  had 
more  votes  then  the  box  in  the  new  set  was  set  to  ‘0’  and 
similarly  for  boxes  where  ‘1’  won  the  “election”.  Tfest- 
ing  the  new  set  of  boxes  resulted  in  20/20  successful 
trials.  Thus,  voting  holds  some  promise  as  a  method  for 
generalising  control  strategies. 

How  many  elections  are  required  before  we  can  be 
sure  of  having  a  robust  control  strategy,  if  at  all?  Tb 
examine  this  question,  we  performed  the  reliability  test 
after  each  round  of  voting.  The  results  are  plotted  in 
Figure  3  where  the  vertical  axis  shows  the  average 
length  of  time  the  pole  was  balanced  for  in  the  20  trials 
of  the  reliability  test. 

The  graph  suggests  that  it  is  necessary  to  learn  to 
balance  the  pole  on  about  20  different  occasions  before 
we  can  construct  a  reliable  control  strategy  using  the 
BOXES  algorithm.  When  different  learning  algo¬ 
rithms  are  used  in  conjunction  with  voting,  the  results 
may  vary.  In  particular,  we  found  that  variants  of 
BOXES  that  learn  to  the  balance  the  pole  more  rapidly 
can  take  much  longer  to  stabilise  when  used  in  voting. 

4.  Experiment  2:  Coercion 

Now  that  we  have  a  way  of  producing  a  reliable  con¬ 
trol  strategy,  how  do  we  extract  simple  and  readable 
rules  if  they  exist?  One  of  the  advantages  of  the  boxes 
representation  is  that  provides  a  simple  “map”  of  the 
problem  space  that  can  be  examined  for  regularities 
and  we  can  use  these  regularities  to  simplify  the  con¬ 
troller.  For  example,  the  set  of  boxes  shown  in  Figure  1 
was  obtained  by  having  a  program  learn  how  to  balance 
the  pole  32  different  times  and  then  collecting  the  votes 
from  each  attempt.  There  are  hints  of  regularity  in  the 
patterns  of  zeros  and  ones,  e.g.  the  rightmost  column 
consists  entirely  of  ones  e.\cept  for  the  last  box.  There 
also  appears  to  be  some  symmetry  which  we  would  ex¬ 
pect  in  this  problem.  Perfectly  clear  patterns  do  not 
emerge  because  not  all  boxes  are  visited  with  equal  fre¬ 
quency,  thus  some  will  have  fewer  opportunities  to 
learn  than  others  and  will  have  settings  that  rely  on 
statistically  unsound  information.  Thus  the  set  of  boxes 
contains  “noise”  which  we  can  try  to  clean  up. 

It  is  worth  noting  that  in  some  regions  of  the  state 
space  (i.e.  in  some  boxes)  it  doesn’t  really  matter  which 
action  is  chosen.  If  the  pole  is  balanced  and  the  cart  is 
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Figure  3:  Scores  after  successive  “elections” 


near  the  middle  if  the  track,  or  if  the  system  is  on  the 
brink  of  failure  and  irrecoverable  then  whether  the  cart 
is  pushed  left  or  right  makes  very  little  difference.  The 
votes  for  different  control  actions  in  these  boxes  are 
usually  very  close,  giving  an  indication  of  their  inde¬ 
cision.  By  applying  a  f  test  to  the  votes,  we  can  ident¬ 
ify  those  boxes  that  are  indecisive  and  “coerce”  the 
votes  so  that  boxes  conform  with  their  surroundings. 
We  use  a  nearest  neighbour  averaging  m'^thod  to 
choose  the  settings  of  indecisive  boxes.  In  this  way,  we 
obtain  the  new  set  of  boxes  shown  in  Figure  4.  It  is  ap¬ 
parent  that  the  array  now  displays  a  symmetry  that 
would  be  expei. .  -A  in  this  particular  task. 
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4:  “Coerced” 

’  boxes 

After  experimenting  with  this  averaging  met’  ifor 
cleaning  up  noise  we  began  to  realise  that  there  are 
similarities  between  the  this  and  image  processing.  The 


set  of  boxes  can  be  thought  of  as  forming  a  four-dimen¬ 
sional  “retina”  that  contains  an  image  of  the  world. 
Just  as  in  image  processing  we  wish  to  produce  clean 
images  and  to  detect  patterns  in  the  image.  At  this 
stage  we  have  not  pursued  this  analogy  further  but  it 
appears  to  be  worthwhile  to  investigate  how  other 
methods  from  image  processing  could  be  borrowed  to 
assist  us. 

In  the  next  section,  we  discuss  how  we  can  use  the 
coerced  boxes  to  simplify  the  representation  of  the 
state  space  and  obtain  control  rules  for  the  pole  and 
cart  problem. 

5.  Experiment  3:  Merging 

From  Figure  4  it  is  clear  that  there  is  very  little  dif¬ 
ference  between  the  outer  two  major  columns  and  the 
inner  two.  There  is  also  some  horizontal  symmetry. 
This  observation  leads  us  to  try  to  merge  regions  of  the 
state  space. 

One  of  the  limitations  of  the  BOXES  algorithm  is 
that  partitioning  the  state  space  must  be  done  by  hand, 
prior  to  any  learning.  If  partitioning  is  too  coarse,  it  may 
be  impossible  to  learn  the  control  task  since  a  box  may 
cover  a  region  of  the  state  space  that  is  too  large  to  class 
as  requiring  only  a  left  push  or  a  right  push.  If  partition¬ 
ing  is  very  fine  then  it  is  more  likely  that  a  successful 
control  strategy  can  be  learnt  but  the  time  required  in¬ 
creases  with  the  number  of  boxes  and  the  strategy  also 
becomes  difficult  to  express  simply.  For  example,  in  our 
original  set  of  boxes  we  effectively  have  162  control 
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rules  for  a  relatively  simple  task.  This  number  can  be 
reduced  significantly  if  we  merge  boxes  by  eliminating 
and  adjusting  some  of  the  partitions. 

As  an  example  of  merging,  let  us  tAamine  the  two 
left-most  major  columns  in  Figure  4,  i.c.  the  “pole 
ler.ns  far  left”  and  “pole  leans  mid  left”  columns.  They 
differ  in  only  one  box,  which  we  will  call  B.  Any  attempt 
to  coerce  this  box  renders  the  whole  control  strategy 
unreliable.  At  first,  this  appears  to  prevent  us  from 
merging  the  two  columns,  but  it  is  possible. 

The  BOXES  algorithm  treats  each  box  as  an  inde¬ 
pendent  learning  unit,  i.e.  individual  boxes  do  not  know 
what  any  other  bf'x  has  learnt.  However,  boxes  do  learn 
to  operate  in  a  cooperative  manner  because  the  scheme 
of  rewards  and  punishments  that  the  algorithm  uses 
tells  each  box  implicitly  how  well  it  is  doing  in  relation 
to  the  whole  system.  Michie  and  Chambers  (1968)  de¬ 
scribe  this  scheme  in  detail.  In  fact  box  B  cannot  be 
changed  because  it  cooperates  with  other  boxes  in  the 
centre  columns.  So  if  one  box  is  to  be  changed,  it  maybe 
necessary  to  change  others  as  well.  Since  we  do  not 
know  which  boxes  cooperate  with  each  other,  when  we 
re-partition  the  state  space,  we  also  have  to  re-learn 
box  settings. 

We  adopted  the  policy  that  only  those  boxes  that  did 
not  pass  the  test  could  be  coerced.  Box  B  did  pass 
the  test,  i.e.  the  majority  of  attempts  at  learning  to  con¬ 
trol  the  pole  set  this  box  to  one.  Therefore  there  is  a 
genuine,  small  difference  between  the  control  strat¬ 
egies  required  for  the  two  different  regions.  To  account 
for  this,  when  we  merge  the  regions  we  shift  the  adjac¬ 
ent  partitions.  We  define  the  merging  operation  as  fol¬ 
lows: 

1.  Two  partitions  are  candidates  for  merging  if  the 
set  of  boxes  contained  in  them  are  similar  to  a 
given  degree,  say  95%. 

2.  We  can  measure  their  degree  of  similarity  by 
taking  each  box  in  turn  in  one  of  the  regions, 
comparing  its  setting  with  that  of  the  correspon¬ 
ding  box  in  the  other  region  and  keeping  a  tally 
of  the  proportion  of  matches  to  mismatches. 

3.  It  the  tallies  indicate  that  the  two  regions  are 
sufficiently  similar,  we  can  eliminate  the  thresh¬ 
old  that  separates  them  and  reset  the  adjacent 
thresholds,  making  their  new  values  mid-way 
between  their  old  values  and  the  value  of  the 
eliminated  threshold. 


Having  obtained  the  coerced  set  of  162 boxes  in  Fig¬ 
ure  4  we  proceeded  to  simplify  the  partitions.  This  can 
be  done  in  a  conservative  way: 

1.  Merge  only  those  partitions  with  the  highest  de¬ 
gree  of  similarity. 

2.  Learn  how  to  control  the  system  using  the  new 
partition. 

3.  Collect  votes  from  at  least  20  learning  runs. 

4.  Coerce  the  indecisive  boxes. 

5.  Repeat  the  whole  process. 

Steps  1  and  4  may  have  several  candidates,  not  all  of 
which  are  reliable  according  the  the  criterion  described 
earlier.  Therefore  one  candidate  is  selected  by  running 
the  reliability  test. 

By  iterating  through  this  procedure  we  obtained  a 
set  of  108  boxes,  then  72  and  finally  the  54  boxes  shown 
in  Figure  5. 
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Figure  5:  Simplified  set  of  boxes  after  merging 

The  representation  is  now  simple  enough  that  we 
can  read  off  a  set  of  rules  in  the  form  of  an  if-then-else 
statement. 
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This  can  be  restated  simply:  “If  the  pole  is  swinging 
rapidly  or  leaning  well  over  in  one  direction  then  push 
in  that  direction.  If  the  pole  is  balanced  but  the  cart  is 
over  to  one  side  then  push  toward  that  side”.  The  sec¬ 
ond  part  ol  the  rule  is  mterestir  g  because  it  seems 
counter-intuitive  at  first  but  in  order  to  move  the  pole 
and  cart  toward  the  middle  of  the  track,  the  pole  should 
first  be  leaning  in  the  direction  we  in  which  we  want  to 
go  so  the  first  step  is  to  move  in  the  opposite  direction 
to  swing  the  pole  in  the  correct  direction. 
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At  this  point  we  should  also  note  a  further  analogy 
with  region  finding  in  image  processing.  Region  finding 
algorithms  attempt  to  either  merge  or  split  neighbour¬ 
ing  regions  in  an  attempt  to  find  a  meaningful  structure 
in  the  image.  In  our  case  we  are  merging  regions  only, 
however,  in  discussing  future  work  Michie  and 
Chambers  (1968)  suggested  that  automatically  merg¬ 
ing  and  splitting  boxes  was  desirable.  Our  work  goes 
some  way  toward  this  goal  but  further  work  is  required, 
particularly  to  obtain  a  data  structure  for  boxes  that  can 
makes  splitting  and  merging  simple. 

6.  Experimental  Variations  and  Future  Ex¬ 
periments 

Cribb  (1989)  has  performed  an  extensive  array  of  ex¬ 
periments  other  than  those  we  are  able  to  report  here. 
These  include  variations  on  the  original  BOXES  algo¬ 
rithm  and  different  methods  for  voting  and  coercion. 
Some  of  these  experiments  are  summarised  below  and 
we  also  discuss  further  experiments  that  we  intend  to 
carry  out. 

In  Figure  3  we  showed  how  combining  results  from 
independent  learning  sessions  allowed  us  to  produce 
more  reliable  control  strategies.  This  graph  was  de¬ 
rived  from  experiments  where  sets  of  boxes  were  com¬ 
bined  and  later  coerced.  We  also  attempted  to  coerce 
boxes  “on  the  fly”  and  use  the  coerced  boxes  in  voting. 
This  method  did  not  produce  a  reliable  control  strategy 
as  consistently  as  the  original  method.  It  appears  that 
coercion  is  only  worth  doing  on  sets  of  boxes  that  have 
already  been  found  to  be  reliable. 

The  experiments  described  in  this  paper  have  been 
repeated  on  a  pole  and  cart  system  where  the  forces 
applied  to  the  cart  were  asymmetrical.  That  is,  the  force 
applied  when  pushing  to  the  left  was  10  Newtons  while 
the  force  used  to  push  to  the  right  was  only  5  Newtons. 
This  is  an  inherently  less  stable  system  and  therefore 
takes  longer  to  learn  to  control.  The  asymmetrical  na¬ 
ture  of  the  problem  is  also  parallelled  by  asymmetry  in 
the  boxes. 

We  mentioned  earlier  that  the  BOXES  algorithm 
has  a  systems  of  rewards  and  punishments  that  implicit¬ 
ly  forces  Doxes  to  cooperate  with  each  other.  When  the 
problem  is  asymmetrical  it  appears  that  this  global 
knowledge  is  less  useful.  A  variation  of  the  learning  al¬ 
gorithm  that  only  used  local  knowledge  of  boxes  on  av¬ 
erage  converged  more  quickly  than  the  original  algo¬ 
rithm.  Hoivever,  when  this  algorithm  was  used  in 
conjunction  with  voting,  many  more  elections  were  re¬ 


quired  before  we  could  not  guarantee  that  a  reliable  set 
of  boxes  would  result.  Thus  we  need  to  investigate 
much  more  thoroughly  the  conditions  under  which  the 
voting  method  can  be  used  with  confidence.  This  ex¬ 
periment  also  supported  our  view  that  rapid  conver¬ 
gence  is  not  always  consistent  with  generality. 

To  properly  validate  our  approach  it  will  be  necess¬ 
ary  to  repeat  these  experiments  using  other  control 
tasks.  We  have  begun  preliminary  work  on  learning  to 
control  a  double  pole,  that  is,  one  pole  is  balanced  on 
the  end  of  another  as  shown  in  Figure  6.  This  gives  us  a 
6-dimensional  state  space.  This  task  is  useful  because 
with  six  dimensions  it  is  impossible  to  visually  inspect 
boxes  for  regularities  and  thus  eliminate  any  possibility 
of  bias  on  our  part  when  writing  our  algorithms. 


W'e  have  also  used  a  satellite  control  problem  in 
some  experiments.  Sammut  and  Michie  (1989)  give  a 
full  account  of  this  work.  We  adapted  the  pole  balanc¬ 
ing  rules  above  so  that  they  could  control  the  thrusters 
of  a  satellite,  the  goal  being  to  maintain  a  stable  atti¬ 
tude  for  the  satellite.  This  domain  is  somewhat  more 
complicated  than  pole  balancing  because  there  are 
many  more  actions.  In  the  case  of  the  pole  balancer 
there  are  only  two  possible  actions:  push  left  or  push 
right.  To  control  a  satellite  we  can  apply  positive  or 
negative  torques  for  either  roll,  pitch  or  yaw.  Thus 
there  is  a  minimum  of  six  actions.  Furthermore,  bang- 
bang  control  is  not  acceptable  since  propellant  usace  is 
critical  for  the  longevity  of  the  space  craft.  As  the 
number  of  control  actions  increases,  the  number  ex¬ 
periments  required  to  find  an  appropriate  box  settings 
increases  dramatically. 

Sammut  and  Michie  (1989)  used  the  experience 
gained  from  learning  pole  balancing  control  strategies 
to  manually  construct  a  controller  for  the  simulation  of 
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a  commercial  satellite.  The  control  rules  are  best  ex¬ 
pressed  as  a  decision  table,  which  happens  to  corre¬ 
spond  closely  to  the  boxes  representation.  The 
strongest  similarity  between  pole  balancing  and  satel¬ 
lite  control  is  that  we  compare  the  rate  of  change  and 
value  of  each  variable  with  a  threshold.  One  of  the  dif¬ 
ferences  is  that  each  of  the  variables  roll,  pitch  and  yaw 
can  be  treated  independently.  Thus  each  variable  has 
Its  own  decision  table.  Tkble  1  shows  the  control  strat¬ 
egy  for  yaw  only.  Tb  illustrate  how  the  table  works, 
when  the  yaw  is  positive  but  the  yaw  rate  is  very  nega¬ 
tive  then  the  satellite  fires  its  yaw  thruster  using  one 


quarter  of  full  thrust.  Note  that  we  are  no  longer  using 
bang-bang  control.  Instead  we  use  discrete  steps  in 
thrust;  no  thrust  (0)  one  quarter  of  full  thrust  (1/4),  one 
half  (1/2)  and  full  thrust  (1).  There  a  three  such  tables 
for  each  of  roll  pitch  and  yaw.  At  each  time  interval  in 
the  simulation,  all  three  table  are  consulted  so  that 
three  actions  may  be  enabled  simultaneously.  We  have 
not  yet  attempted  to  learn  control  strategies  in  this  do¬ 
main  but  intend  to  do  so  since  the  additional  computa¬ 
tional  demands  brought  about  by  the  complexity  of  the 
task  require  us  to  find  more  efficient  methods  of  learn¬ 
ing. 


Thble  1:  Yaw  Decision  Thble 
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7.  Discussion 

One  of  the  most  exciting  results  of  this  work  has 
been  the  convergence  of  the  experimentally  derived 
rules  shown  above  and  rules  that  Makarovic  (1987)  ob¬ 
tained  from  a  mathematical  analysis  of  the  equations  of 
motion  of  the  pole  cart.  Except  for  small  differences  in 
threshold  values,  the  rules  are  the  same.  Moreover,  we 
were  able  to  transfer  much  of  what  we  had  learnt  from 
pole  balancing  to  other  control  tasks  (Sammut  and 
Michie,  1989).  One  important  implication  of  these  re¬ 
sults  is  that,  with  further  work,  it  may  be  possible  to  use 
these  methods  to  learn  the  behaviour  of  ‘black  boxes’ 
and  then  by  analysing  the  behaviour,  build  a  more  ab¬ 
stract  theory  of  the  operation  of  the  black  box.  That  is, 
we  see  this  style  of  learning  as  an  important  adjunct  to 
scientific  discovery  programs. 

There  is  slili  much  that  needs  to  "oe  done  in  order  io 
refine  these  methods  into  usable  tools.  We  have  mostly 
worked  “by  the  seats  of  our  pants”,  trying  something 
when  it  looked  right.  This  is  unsatisfactory  and  requires 
a  better  theoretical  understanding  of  the  learning  algo¬ 
rithms  and  further  experimental  work  to  validate  some 
of  our  conjectures. 


The  goal  of  this  work  has  been  to  use  a  reinforce¬ 
ment  learning  algorithm  to  acquire  expertise  in  con¬ 
trolling  a  dynamic  system  and  then  analysing  the  results 
to  produce  simple,  readable  control  rules.  In  the  course 
of  this  analysis,  wc  found  that  fast  learning  algorithms 
tend  to  produce  controllers  that  are  very  specific  to  the 
task,  i.e.  they  do  not  generalise  well.  In  some  respects 
this  is  similar  to  the  tendency  of  some  learning  algo¬ 
rithms  to  over-fit  data  (Quinlan,  1987;  Fisher,  1989). 
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Abstract 

This  paper  considers  adaptive  control  archi¬ 
tectures  that  integrate  active  sensory-motor 
systems  with  decision  systems  based  on  rein¬ 
forcement  learning.  One  unavoidable  conse¬ 
quence  of  active  perception  is  that  the  agent’s 
internal  representation  often  confounds  ex¬ 
ternal  world  states.  We  call  this  phenomenon 
perceptual  aliasing  and  show  that  it  desta¬ 
bilizes  existing  reinforcement  learning  algo¬ 
rithms  with  respect  to  the  optimal  decision 
policy.  A  new  decision  system  that  over¬ 
comes  these  difficulties  is  described.  The 
system  incorporates  a  perceptual  subcycle 
within  the  overall  decision  cycle  and  uses  a 
modified  learning  algorithm  to  suppress  the 
effects  of  perceptual  aliasing.  The  result  is  a 
control  architecture  that  learns  not  only  how 
to  solve  a  task  but  also  where  to  focus  its  at¬ 
tention  in  order  to  collect  necessary  sensory 
information. 

1  Introduction 

Recently  there  has  been  a  resurgence  of  interest  in  in¬ 
telligent  control  architectures  that  are  based  on  rein¬ 
forcement  learning  methods  (RLM)  [Barto  et  ai,  1989; 
Clocksin  and  Moore,  1988;  Holland,  1986;  Miller  ei  ai, 
1990;  Sutton,  1988;  Watkins,  1989;  Whitehead  and 
Ballard,  1989a;  Wilson,  1987a].  These  architectures 
are  appealing  because  they  are  both  reactive  and  adap¬ 
tive.  Unlike  traditional  plan  based  controllers,  RLM 
systems  do  not  make  decisions  by  appealing  to  time 
consuming  search  through  a  space  of  possible  plans. 
Instead,  they  maintain  a  policy  function  that  maps 
situations  directly  into  actions.  Decision  making  re¬ 
duces  to  computing  the  instantaneous  value  of  the  pol¬ 
icy  function  and  can  be  performed  in  constant  time  — 
c.g.  a  puiiey  fuiictioii  Cau  be  iniplenicntcd  using  a 
table,  CMAC,  or  neural  net  (all  of  which  can  be  eval¬ 
uated  in  constant  time). 
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The  immediacy  of  decision  making  puts  RLM  sys¬ 
tems  in  close  relationship  with  other  reactive  systems 
(Agre  and  Chapman,  1987;  Brooks,  1986;  Drummond, 
1989;  Firby,  1987;  Georgelf  and  Lansky,  1987;  Nilsson, 
1989;  Schoppers,  1987].  However,  RLM  systems  dis¬ 
tinguish  themselves  from  these  and  most  reactive  sys¬ 
tems  in  that  they  are  adaptive.  The  vast  majority  of 
reactive  systems  do  not  learn.  Instead,  their  decision 
knowledge  is  hand  coded  into  them  by  their  design¬ 
ers;  either  explicitly  (e.g.  [Agre,  1988;  Brooks,  1986; 
Georgeff  and  Lansky,  1987;  Firby,  1987])  or  through 
hand  coded  causal  models  which  eventually  get  com¬ 
piled  into  a  set  of  reactive  rules  (e.g.  [Blythe  and 
Mitchell,  1989;  Fikes  et  ai,  1972;  Laird  et  ai,  1986; 
Schoppers,  1989]).  RLM  systems  do  not  rely  on  hand 
coded  decision  knowledge.  They  learn  the  optimal  con¬ 
trol  strategy  by  interacting  with  the  world  and  receiv¬ 
ing  feedback  in  the  form  of  reinforcement.  This  adapt¬ 
ability  relieves  the  burden  of  providing  complete  do¬ 
main  knowledge  a  priori,  since  it  is  acquired  with  expe¬ 
rience.  It  also  allows  the  system  to  adapt  to  changing 
circumstances  and  learn  new  tasks. 

Although  RLM  systems  are  promising,  to  date  they 
have  only  been  applied  to  relatively  simple  tasks, 
such  as  pole  balancing  [Barto  et  ai,  1983;  Sutton, 
1984],  simplified  navigation  [Barto  and  Sutton,  1981; 
Booker,  1982;  Sutton,  1990;  Watkins,  1989;  Wilson, 
1987a],  and  easy  manipulation  games  [Anderson,  1989; 
Whitehead  and  Ballard,  19893].  Before  these  systems 
can  be  scaled  to  larger,  more  complex  control  prob¬ 
lems  a  number  of  issues  must  be  addressed.  These 
include  developing  techniques  for  improving  the  learn¬ 
ing  rate,  developing  more  sophisticated  uses  for  inter¬ 
nally  held  goals,  and  incorporating  more  realistic  mod¬ 
els  of  perception  and  action.  Early  progress  on  the  first 
two  of  these  issues  looks  promising  (for  faster  learn¬ 
ing  see  [Sutton,  1990;  Whitehead  and  Ballard,  1989a; 
Whitehead  and  Ballard,  1989b];  for  models  using  in¬ 
ternal  goals  see  [Watkins,  1989;  Whitehead,  1989; 
Wilson,  19b7bJ.)  ihis  paper  deals  with  the  third  iv 
sue  of  integrating  realistic  sensory-motor  systems  into 
RLM  architectures. 

The  vast  majority  of  work  in  AI  has  not  dealt  re¬ 
alistically  with  perception,  and  research  in  reinforce¬ 
ment  learning  is  no  exception.  A  common  simplifying 
assumption  is  that  a  decoupled  perceptual  system  au- 
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tomatically  supplies  a  central  control  system  with  a 
database  of  logically  consistent  inputs  that  describes 
in  detail  each  object  in  the  domain  —  that  is  a  set  of 
propositions  that  describe  the  relationships  between, 
and  the  features  of  all  the  objects  in  the  domain.  Un¬ 
fortunately,  even  for  simple  toy  domains  this  leads  to 
large,  complex  internal  representations  and  unrealistic 
assumptions  about  the  capabilities  of  the  perceptual 
system.  For  example,  in  a  classical  blocks  world  do¬ 
main  containing  n  blocks,  the  size  of  the  state  space 
using  a  traditional  representation  is  0(n!).  For  n  =  19 
that  is  over  two  trillion  (2,147,483,647)  states.  Most  of 
the  information  that  distinguishes  states  in  the  inter¬ 
nal  representation  is  irrelevant  to  immediate  task  faced 
by  the  agent  and  only  interferes  with  decision  making 
(and  learning)  by  clogging  the  system  with  irrelevant 
details.  Further,  these  overly  descriptive  representa¬ 
tions  put  undue  pressure  on  the  perceptual  system  to 
maintain  their  fidelity. 


Agre  and  Chapman  have  recognized  this  problem 
and  suggested  indexical  represeniaiions,  a  much  more 
feasible  approach  based  on  active  sensory-motor  sys¬ 
tems  [Agre,  1988;  Agre  and  Chapman,  1987;  Chap¬ 
man,  1989).  A  central  premise  underlying  indexical 
representations  is  that  a  system  needn’t  name  and  de¬ 
scribe  every  object  in  the  domain,  but  instead  should 
only  register  information  about  objects  that  are  rele¬ 
vant  to  the  task  at  hand.  Further,  those  objects  should 
be  indexed  according  to  the  function  they  play  in  the 
current  behavior.  Two  important  implications  of  this 
approach  are:  1)  it  leads  to  compact  and  limited  scope 
input  representations  since  at  any  moment  the  sensory 
system  registers  only  the  features  of  a  few  key  objects; 
and  2)  it  leads  to  systems  that  actively  control  their 
perceptual  apparatus  —  i.e.  actively  manipulate  the 
binding  between  objects  in  the  world  and  internal  rep¬ 
resentational  structures.  Thus,  in  the  case  of  a  blocks 
world  problem,  the  system  would  represent  blocks  im¬ 
mediately  relevant  to  the  task,  and  would  be  oblivious 
to  the  rest.  As  a  result,  the  size  of  the  internal  rep¬ 
resentation  reflects  the  complexity  of  the  task  being 
solved  and  not  the  number  of  objects  in  the  domain 
(which  could  be  arbitrary!). 


We  have  been  experimenting  with  systems  that  in¬ 
corporate  both  indexical  representations  (for  feasible 
perception)  and  RLMs  (for  adaptive  control).  In  par¬ 
ticular,  we  show  that  integrating  indexical  represen¬ 
tations  (and  active  perception,  in  general)  and  RLM 
into  a  single  control  architecture  is  non-trivial  because 
the  use  of  indexical  representations  results  in  internal 


of  the  external  world.  We  term  this  phenomenon  per¬ 


ceptual  aliasing  and  show  that  it  leads  to  undesired 
local  maxima  in  the  decision  system’s  evaluation  func¬ 
tion.  These  local  maxima  severely  interfere  with  the 


decision  system’s  ability  to  learn  an  adequate  control 
policy.  To  overcome  these  difficulties  a  new  RLM  de¬ 
cision  system  is  introduced  that  embeds  a  perceptual 
cycle  within  the  overall  decision  cycle  and  uses  a  mod¬ 
ified  learning  algorithm  to  eliminate  the  undesired  lo¬ 
cal  maxima.  The  result  is  a  reactive  architecture  that 


learns  both  the  overt  physical  action  needed  to  solve 
a  problem,  and  where  to  focus  its  attention  in  order 
to  disambiguate  the  current  situation  with  respect  to 
the  task.  These  ideas  are  illustrated  in  a  system  that 
learns  a  simple  block  manipulation  task. 


2  Foundations 

2.1  Embedded  Learning  Systems 

Before  getting  into  the  details  of  indexical  represen¬ 
tations,  reinforcement  learning,  and  perceptual  ambi¬ 
guity,  it  is  useful  to  formalize  concepts  such  as  “the 
world” ,  “the  agent” ,  “the  sensory-motor  system”  and 
“the  decision  system”.  For  this  purpose  we  begin 
by  adopting  a  formal  model  for  describing  embedded 
learning  systems.  The  model  is  shown  in  Figure  1, 
and  extends  a  model  proposed  by  Kaelbling  [Kaelbling, 
1989)  by  representing  the  relationship  between  exter¬ 
nal  world  states  and  the  agents  internal  representation. 

The  world  is  modeled  as  a  deterministic  automa¬ 
ton  whose  state  changes  depend  on  the  actions  of  an 
agent.  The  world  is  formally  described  by  the  triple 
(5£,  Ajs,  W),  where  Se  is  the  set  of  possible  world 
states,  Ajs  is  the  set  of  possible  physical  actions  ex¬ 
ecutable  by  the  agent,  and  W  is  the  state  transition 
function  mapping  Sb  x  Ab  into  Ss-  Our  model  of  the 
agent  is  slightly  more  complex  consisting  of  three  com¬ 
ponents:  a  sensory-motor  subsystem,  a  reward  center, 
and  a  decision  subsystem. 

The  sensory-motor  subsystem  implements  three 
functions;  1)  a  perceptual  function  V,  2)  an  internal 
configuration  function  X,  and  3)  a  motor  function  M. 
Its  main  purpose  is  to  relativize  the  external  trans- 
ductory  signals  Sb  and  Ae-  On  the  sensory  side  the 
system  transduces  the  world  state  into  the  agent’s  in¬ 
ternal  representation.  Since  perception  is  active,  this 
mapping  is  dynamic  and  depends  on  the  configuration 
of  the  sensory-motor  apparatus.  Formally,  the  rela¬ 
tionship  between  external  world  states  and  the  agent’s 
internal  representation  is  modeled  by  the  perceptual 
function  V  which  maps  world  states  Sb  and  sensory- 
motor  configurations  C  onto  internal  representations 
5/  (i.e.  V  •.  Sb  y-  C  —*  Si).  On  the  motor  side,  the 
agent  has  a  set  of  internal  motor  commands  A/  that  af¬ 
fect  the  model  in  two  ways,  they  can  either  change  the 
state  of  the  external  world  (by  being  translated  into 
external  actions,  Ae),  or  they  can  change  the  config¬ 
uration  of  the  sensory-motor  subsystem.  As  with  per¬ 
ception,  the  configuration  of  the  sensory-motor  sub¬ 
system  modulates  the  effects  of  internal  commands. 

nnu:^  _ j-i-j  l..  au.  r. _ *: - ---J 

X  AAIO  to  iiiUUCiCU  iUitObiUtiO  ^V(  aiiU 

J  which  map  internal  commands  and  sensory-motor 
configurations  into  actions  in  the  external  world  and 
new  sensory-motor  configurations  respectively.  That 

IS,  M  :  Aj  X  C  Ae  and  X  :  Aj  x  C  C. 

The  second  subsystem  in  our  model  is  the  reward 
cent  it.  It  implements  a  reward  function  'll,  which 
maps  external  world  states  Se  into  real  valued  rewards 
5ft.  Rewards  indicate  that  the  world  in  a  desirable  state 
and  are  used  by  the  decision  subsystem  to  improve  per¬ 
formance. 
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The  agent 


Figure  1;  A  formal  model  for  embedded  learning  sys¬ 
tems  with  active  sensory-motor  subsystems. 


The  final  component  of  the  agent  is  the  decision  sub¬ 
system.  This  subsystem  is  like  a  homunculus  that  sits 
Inside  the  agent’s  head  and  controls  its  actions.  On 
the  sensory  side,  the  decision  subsystem  does  not  have 
access  to  the  state  of  the  external  world,  but  only  the 
agent’s  internal  representation.  Similarly  on  the  motor 
side,  the  decision  subsystem  generates  internal  action 
commands  that  <itc  interpreted  by  the  sensory-motor 
system.  Formally,  the  decision  subsystem  implements 
a  behavior  function  B  mapping  sequences  of  internal 
states  and  rewards,  (S/  x3J)*  into  internal  actions,  Aj. 
The  objective  of  the  subsystem  is  to  implement  a  con¬ 
trol  policy  that  maximizes  its  return,  which  is  defined 
as  a  discounted  sum  of  the  reward  received  over  time. 

CO 

return  =  (1) 

n=0 

where  rt  is  the  reward  received  at  time  t,  and  7  is  a 
discount  factor  between  0  and  1. 


2.2  Indexical  Representations 

The  centra!  concept  underlying  indexical  representa 
tions  is  the  marker,  around  which  nearly  all  action  and 
perception  revolves.  Markers  are  similar  to  variables, 
implemented  by  the  sensory-motor  system;  they  get 
dynamically  bound  to  objects  in  the  world  and  remain 
bound  to  those  objects  until  being  bound  to  other  ob¬ 
jects.  Changing  a  marker’s  binding  is  accomplished  by 
executing  explicit  actions  specifically  targeted  for  that 
marker.  For  example,  a  system  might  have  a  marker 
Ml  and  an  associated  action  Warp-Mi-io-Red,  which 
is  used  to  index  and  bind  red  objects  to  Mi-  In  this 


case,  executing  Warp-Ml-io-Red  causes  the  sensory- 
motor  system  to  search  the  world  for  a  red  object  and 
bind  it  to  Ml.  If  a  red  object  cannot  be  found,  the 
action  fails  and  MJ’s  binding  remains  unchanged.  If 
multiple  red  objects  exist,  the  sensory-motor  system 
chooses  the  first  one  it  comes  to.  Markers  play  an  im¬ 
portant  role  in  motor  control  since  overt  actions  are 
predominately  specified  with  respect  to  markers.  In 
this  case,  a  marker’s  binding  acts  to  establish  the  ref¬ 
erence  frame  in  which  the  action  is  performed.  For 
example,  the  overt  action  Place-at-Ml  might  cause  a 
robot  to  place  an  object  it  is  bolding  at  the  location 
currently  associated  with  marker  Ml.  We  distinguush 
two  types  of  markers,  overt  markers  and  perceptual 
markers.  A  marker  is  overt  if  it  has  an  action  associ¬ 
ated  with  it  that  affects  the  state  of  the  external  world. 
Otherwise  it  is  a  perceptual  marker.  Overt  markers 
are  used  for  establishing  reference  frames  for  action  in 
the  world,  while  perceptual  markers  are  used  for  col¬ 
lecting  additional  information  about  the  current  state. 
Actions  associated  with  overt  markers  are  called  overt 
actions  and  actions  associated  with  perceptual  mark¬ 
ers  are  called  perceptual  actions. 

A  key  feature  of  markers  is  the  constraint  that  there 
are  only  a  limited  number  of  them,  say  less  than  ten 
(the  system  described  below  has  two  markers).  The 
small  number  of  markers  and  the  limited  number  of 
features  associated  with  each  marker  keeps  both  the 
internal  representation  and  the  number  of  possible  ac¬ 
tions  much  smaller  than  is  possiole  with  conventional 
representations.  If  an  object  in  the  world  is  not  bound 
to  a  marker,  then  it  is  invisible  to  the  system  (except 
for  the  effects  it  registers  in  peripheral  inputs).  * 

In  an  indexical  representation  the  system’s  sensory 
inputs  fall  into  three  general  categories,  peripheral  as¬ 
pects,  local  aspects,  relational  aspects.  Peripheral  as¬ 
pects  register  general,  spatially  non-specific  informa¬ 
tion  about  the  world,  such  as  the  presence  or  absence 
of  certain  colors,  shapes,  and  motions.  Both  local 
aspects  and  relational  aspects  register  properties  of 
marked  objects.  Local  aspects  register  intrinsic,  lo¬ 
cal  features  of  a  marked  object,  such  as  its  shape, 
color,  and  texture.  Relational  aspects  register  rela¬ 
tional  properties  between  marked  objects,  such  as  their 
relative  shape,  relative  color,  and  relative  position. 

As  an  example,  Figure  2  lists  the  specifications  for 
the  indexical  sensory-motor  system  used  by  a  program 
(to  be  described  later)  that  learns  to  solve  a  simple 
block  manipulation  task.  The  system  has  two  mark¬ 
ers,  the  action  frame  and  the  attention  frame.  The 
action  frame  is  used  for  both  perception  and  action, 
while  the  attention  frame  is  used  only  for  perception. 
Each  marker  has  a  set  of  local  aspects  associated  with 
it;  these  are  the  color  and  shape  of  the  marked  ob¬ 
ject,  the  height  of  the  stack  the  marked  object  be- 

‘  Agre  and  Chapman  do  not  distinguish  markers  as  overt 
or  perceptual  in  their  systems.  The  two  classes  were  in¬ 
troduced  because  the  learning  algorithm  described  below 
depends  on  distinguishing  between  actions  that  change  the 
world  and  actions  that  simply  change  the  perception  of  the 
world. 
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Figure  2:  The  specification  for  an  indexical  sensory- 
motor  system  containing  two  markers.  The  system  has 
a  20  bit  input  vector,  8  overt  actions,  and  6  perceptual 
actions.  The  values  registered  in  the  input  vector  and 
the  effects  of  internal  action  commands  depend  on  the 
binding  between  markers  in  the  sensory-motor  system 
and  objects  in  the  external  world. 


longs  to,  whether  or  not  the  marked  object  is  sitting 
on  the  table,  and  whether  or  not  the  marked  object 
is  held  by  the  robot.  The  system  has  two  relational 
aspects;  one  for  recording  vertical  alignment  betv/een 
the  two  markers  and  one  for  recording  horizontal  align¬ 
ment.  Peripheral  aspects  include  inputs  for  detecting 
the  presence  of  red,  green,  and  blue  objects  in  the 
scene,  and  for  detecting  whether  the  hand  is  currently 
gripping  an  object. 

The  internal  motor  commands  for  the  system  are 
shown  in  Figure  2  on  the  right.  The  overt  actions 
are  made  with  respect  to  the  action  frame.  The 
two  primary  overt  actions  are  for  grasping  and  plac¬ 
ing  objects.  For  grasping,  grasp-object-at-action-frame 
causes  the  system  to  pick  up  the  object  marked  by  the 
action  frame.  The  action  works  6issuming  the  hand  is 
empty  and  the  marked  object  has  a  clear  top.^  Sim¬ 
ilarly  for  placing,  place-objeci-at-aciion-frame  causes 
the  system  to  place  an  object  its  holding  at  the  loca¬ 
tion  of  the  action  frame.  This  action  also  works  assum¬ 
ing  the  hand  is  holding  an  object  and  the  target  object 
has  a  clear  top.  Other  overt  actions  include  actions  for 
moving  the  action  frame.  Although  these  may  appear 


^Although  this  isn’t  strictly  necessary  since,  unlike  tra¬ 
ditional  planning  systems  (e.g.  STRIPS  [Fikes  and  Nilsson, 
197i]),  the  agent  doesn’t  rely  on  any  sort  of  internal  model 
to  predict  the  effects  of  actions  but  simply  watches  what 
happens  in  the  world. 


to  be  perceptual  actions,  they  are  overt  actions  in  the 
strictest  sense  since  they  affect  the  robots  ability  to 
perform  otlier  overt  actions. 

The  attention  frame  is  a  perceptual  marker  and  has 
a  repertoire  of  perceptual  actions  that  are  used  exclu¬ 
sively  for  gathering  additional  sensory  information.  As 
will  be  seen  in  Section  4,  the  attention  frame  plays  an 
important  role  in  allowing  the  agent  to  disambiguate 
world  states. 

All  told  the  sensory-motor  system  has  a  20  bit  input 
vector:  (4  bits  of  peripheral  aspects,  14  bits  of  local 
aspects,  2  bits  of  relational  aspects);  and  14  actions  (8 
overt,  6  perceptual). 

2.3  Reinforcement  Learning 

The  task  faced  by  the  decision  subsystem  is  a  classic 
decision  problem:  given  a  description  of  the  current 
state,  a  set  of  possible  actions,  and  previous  experi¬ 
ence;  choose  the  best  next  action.  There  are  a  number 
of  approaches  to  this  problem,  but  in  this  paper  we 
are  interested  in  decision  systems  that  are  based  on 
reinforcement  learning  methods  (RLM).  In  our  exper¬ 
iments  we  have  focussed  on  a  representative  reinforce¬ 
ment  learning  method  known  as  Q-leaming  [Watkins, 
1989],  however  our  result  regarding  the  interactions 
between  RLMs  and  active  perception  apply  to  virtu¬ 
ally  all  RLMs[Barto  ei  al.,  1989). 

A  decision  system  based  on  Q-learning  maintains 
two  inter-dependent  functions:  an  action-value  func¬ 
tion  Q,  and  a  policy  function  ir.  The  action-value  func¬ 
tion,  Q  represents  the  system’s  estimate  of  the  relative 
merit  of  making  a  given  decision.  That  is,  Q{x,  a)  is 
the  return  the  system  expects  to  receive  given  that 
it  executes  action  a  in  state  x  and  follows  its  regular 
policy  (tt)  thereafter.  The  policy  function,  tt  is  the 
system’s  current  estimate  of  the  optimal  decision  pol¬ 
icy.  This  function  maps  internal  states  (x  Q  S/)  into 
action  commands  (a  S  A/)  and  is  defined  in  terms  of 
the  action-value  function  as  follows: 

ir(x)  =  a  such  that  Q(a;,a)  =  max(Q(x,6))  (2) 

tSAt 

That  is,  for  a  given  state  x,  the  policy  function  simply 
selects  the  action  a,  that  (according  to  Q)  meiximizes 
the  expected  return. 

Initially,  the  action-value  function  may  be  erro¬ 
neous.  However  a  procedure  for  incrementally  improv¬ 
ing  Q  can  be  obtained  by  recognizing  that  an  accurate 
action- value  function  satisfies  the  following  relation. 

Q(x,,a)  =  E[/t(A',+i)  -t-  7t7(A,+i)]  (3) 

where  E()  denotes  the  expectation,  Aj+i  is  the  random 
variable  denoting  the  state  at  time  t  -f  1,  iZ(i)  is  the 
reward  the  system  receives  upon  entering  state  x,  and 
U(x)  is  called  the  evaluation  function  and  is  defined 
as 

U  (x)  =  maxie/i,  (Qix,  6))  (4) 

Based  on  these  equations,  the  following  rule  can  be 
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4a) 


l)iTi<li)n  CVfli-  for  1-slcp  O-lcarnlni!: 

1)  Gcncrale  a  random  number  p,  between  0.0  and  1.0, 

2) if(p<0.9) 

then  aelion «-  *<»),  where  TT  is  the  policy  function  and  a  is  the  current  suie 
else  actiox  <-  f!(Ai)  ,  where  PO  «  •  random  selection  function. 

3)  Eaecute  action,  let  be  the  resulting  suie,  and  r  the  reward  received 

4)  Compute  the  1-stcp  error: 

error  ♦-  r  +  J</(a.„)  -  Q(x,action) 

5)  Update  the  action  value  of  the  selected  decision: 

(2(1,  action)  <-  C(x,  action)  *  a  error 

6)  Update  the  decision  policy  (for  sute  x): 

n(x)  «  a  such  that(2(a,o)  «  max».  4i(!2(*.Wl 

7)  Update  Oie  evaluation  function  (for  sUte  x): 

(/(x) «-  Q(*.>t<x)) 

8)  Update  the  current  sute:  x  *- 

9)  Goto  1 

Figure  3:  The  steps  in  the  decision  cycle  of  a  decision 
system  based  on  1-step  Q-learning. 


used  for  updating  Q:® 

Qt+i(a:t.a)  =  Q((2:<.a)+a(n+i+7t^<(a:<+i)-Q«(zf.a)) 

where  Qt  is  the  action- value  function  at  time  i,  rj+i 
is  the  reward  receives  at  time  <  +  1,  and  a  is  a  con¬ 
stant  that  affects  the  learning  rate.  This  updating  rule 
is  known  as  the  1-siep  Q-leaming  rule,  and  the  term 
rt  +  7C/((x(+i)  is  called  the  corrected  1-step  truncated 
return.  Watkins  has  shown  that  under  the  appropri¬ 
ate  conditions,  a  decision  system  using  the  1-step  Q- 
learning  rule  is  guaranteed  eventually  to  learn  the  op¬ 
timal  decision  policy  [Watkins,  1989], 

Figure  3  outlines  the  control  procedure  for  one 
simple  decision  subsystem  implementing  1-step  Q- 
learning.  The  first  step  in  the  procedure  is  to  select 
the  next  action.  90%  of  the  time  the  system  selects  the 
action  specified  by  its  control  policy  7r(x);  the  remain¬ 
ing  10%  of  the  time  it  chooses  an  action  at  random. 
The  action  is  then  executed  and  the  subsequent  state 
and  reward  are  noted.  Once  the  effects  of  the  action 
are  known,  the  error  in  the  action- value  function  for 
the  current  decision  can  be  computed  and  used  to  up¬ 
date  Q.  Finally,  7r(x)  and  U{x)  are  updated  to  reflect 
changes  in  Q.  The  reason  the  decision  system  doesn’t 
always  select  the  action  specified  by  its  policy  is  that 
the  action-value  of  a  decision  is  only  updated  when 
that  decision  is  executed.  Occasionally  selecting  a  ran¬ 
dom  action  insures  that  each  decision  will  be  evaluated 
periodically.  In  our  experiments  we  implemented  a 
similar  but  slightly  more  complex  learning  procedure 
that  uses  a  weighted  sum  of  n-step  error  terms  and  is 
based  on  Sutton’s  'ID  methods.  (See  [Sutton,  1988, 
Watkins,  1989]  for  details.) 

3  Perceptual  Aliasing 

The  straightforward  integration  of  indexical  represen¬ 
tations  and  RLM  decision  systems  lead  to  undesir- 

®See  [Barto  ct  at.,  1989]  or  [Watkins,  1989]  for  a  detailed 
derivation. 
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Figure  4:  Generally  the  mapping  between  external 
world  states  and  the  agent’s  internal  representation 
is  many  to  many,  a)  shows  how  two  different  external 
states  can  generate  the  same  internal  representations 
and  b)  shows  how  one  external  state  may  have  more 
than  one  internal  representation. 


able  interactions  that  prevent  the  decision  system  from 
learning  an  optimal  control  strategy.  These  interac¬ 
tions  arise  because  the  mapping  between  world  states 
and  internal  states  is  many  to  many.  That  is,  a  state 
Se  €  Se  in  the  world,  depending  on  the  configuration 
of  the  sensory  system,  may  map  to  several  internal 
states;  and  conversely,  a  single  internal  state,  s,  G  S/, 
can  represent  several  world  states.  We  call  this  over¬ 
lapping  between  world  and  internal  state  spaces  per¬ 
ceptual  aliasing  and  say  an  internal  state  is  percep¬ 
tually  ambiguous  if  it  can  represent  multiple  world 
states.  Figure  4  illustrates  perceptual  aliasing  in  a  sim¬ 
ple  block  world  domain  in  which  we  adopt  the  sensory- 
motor  system  defined  in  Figure  2.  Figure  4a  shows  two 
different  world  states  (top)  that  generate  the  same  in¬ 
ternal  representation  simply  because  the  markers  are 
focussed  on  the  part  the  states  that  are  similar.  In  the 
figure  the  (-f )  represents  the  action-frame  marker  and 
the  (+)  represents  the  attention-frame  marker.  Sim¬ 
ilarly,  Figure  4b  shows  that  a  single  world  state  has 
multiple  internal  representations,  each  corresponding 
to  a  different  placement  of  the  attention-frame  market. 

Perceptual  aliasing  has  a  devastating  impact  on  the 
decision  system’s  ability  to  learn  an  adequate  con¬ 
trol  policy  because  it  causes  the  system  to  confound 
world  states  that  it  must  distinguish  in  order  to  solve 
the  task.  The  easiest  way  to  illustrate  the  problem 
is  to  consider  tasks  in  which  the  decision  system  re¬ 
ceives  a  fixed  reward  upon  achieving  a  particular  goal 


184  Whitehead  and  Ballard 


state  ff  G  5b,  and  otherwise  receives  no  reward.  In 
this  case,  it  can  be  shown  that  the  state  evaluations 
£/(s),  and  the  action-values,  Q{s.a),  of  the  steps  in 
the  optimal  policy  monotonically  increase  as  the  sys¬ 
tem  approaches  the  goal  state.  Essentially,  the  op¬ 
timal  policy  corresponds  to  a  gradient  ascent  of  the 
evaluation  function.  The  upper  graph  in  Figure  5 
illustrates  this  result  schematically  for  the  optimal 
sequence  51,52,53,54,55,5,  for  getting  from  5)  to  g. 
Each  node  in  the  graph  represents  a  world  state  and 
each  arc  represents  a  possible  decision.  Solid  arcs  rep¬ 
resent  optimal  decisions  and  dotted  arcs  represent  non- 
optimal  decisions.  Shown  below  each  state  is  its  cor¬ 
responding  evaluation,  U{s),  and  below  each  decision 
is  its  corresponding  action-value,  Q(s,  a). 

If  the  decision  system  has  direct  access  to  the  world 
states,  experience  has  shown  that  reinforcement  learn¬ 
ing  methods  can  successfully  learn  an  adequate  if  not 
optimal  decision  policy"*.  Unfortunately,  the  decision 
system  does  not  have  access  to  the  state  of  the  ex¬ 
ternal  world,  but  necessarily  accesses  the  world  via 
a  relativized  representation  generated  by  the  sensory 
system.  This  fact  has  been  widely  ignored  in  previ¬ 
ous  work  on  reinforcement  learning.  Existing  systems 
ignore  the  issue  of  perception  completely  or  assume  a 
one  to  one  mapping  between  the  external  state  and 
the  internal  representation.  Such  simplifications  are 
natural  in  toy  domains  but  become  painfully  inade¬ 
quate  when  scaling  to  larger  problems.  If  the  mapping 
between  the  external  world  and  the  internal  prepresen¬ 
tation  allows  for  perceptual  aliasing,  then  the  optimal 
policy  may  become  unstable  with  respect  to  the  learn¬ 
ing  algorithm.  That  is,  because  of  perceptual  aliasing, 
the  learning  algorithm  will  actually  prevent  the  system 
from  settling  on  the  optimal  policy. 

The  lower  graph  in  Figure  5b  shows  how  this  in¬ 
stability  arises.  Consider  the  mapping  between  the 
world  states  and  the  internal  states  for  the  chunk  of 
the  state  spaces  shown.  The  mapping  is  one  to  one 
for  all  states  except  external  states  $2  and  ss,  which 
map  to  internal  state  s|, .  If  we  fix  the  decision  policy 
so  that  the  system  follows  the  optimal  policy  90%  of 
the  time  and  chooses  a  random  action  10%  of  the  time 
and  allow  the  system  to  estimate  its  evaluation  func¬ 
tion,  U  and  action-value  function,  Q,  by  running  the 
system  for  many  trials,  we  find  the  following.  First, 
since  the  state  evaluation  and  action-value  functions 
represent  expected  returns,  for  sj,  they  take  on  val¬ 
ues  somewhere  between  the  corresponding  values  for 
52  and  55.  That  is,  U{s2)  <  U{s'^)  <  U{s5)  and 
<3(52, flr)  <  QiSa,cir)  <  Q(ss,Or),  where  Or  is  the  op¬ 
timal  actions  associated  with  s|,.  Second,  the  state 
evaluations  and  action  values  do  not  monotonically 


■‘For  1-step  Q-learning,  Watkins  has  shown  that  appro¬ 
priately  designed  decision  systems  can  be  guaranteed  to 
converge  on  the  optimal  policy.  Proofs  for  other,  more 
general  algorithms  do  not  currently  exist  but  many  ex¬ 
perimental  systems  have  been  built  that  learn  optimal  (or 
near  optimal)  policies  [Anderson,  1989,  Barto  et  al.,  1983, 
Clocksin  and  Moore,  1988,  Watkins,  1989,  Whitehead, 
1989]. 
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Figure  5;  An  example  of  the  effect  of  perceptual  alias¬ 
ing  on  the  evaluation  function.  The  top  graph  shows 
a  state  transition  diagram  for  a  simple  problem.  In 
this  case,  the  system  receives  reward  only  upon  enter¬ 
ing  state  G  and  the  evaluation  function  monotonically 
increases  as  the  distance  from  the  goal  decreases.  The 
bottom  graph  shows  the  internal  state  space  for  the 
problem  when  the  sensory  system  confounds  states  S2 
and  S5.  In  this  case,  the  evaluation  function  is  no 
longer  monotonic  and  the  optimal  decision  policy  is 
unstable. 

increase  as  the  system  approaches  the  goal.  Instead 
a  local  maximum  in  the  evaluation  function  arises  at 
Sg.  We  call  this  maximum  an  aberrational  maximum 
since  it  doesn’t  reflect  the  true  utility  of  the  underlying 
decision  problem.  If  we  relax  our  hold  on  the  decision 
policy  and  allow  the  decision  system  to  adapt,  we  find 
the  optimal  policy  is  unstable!  Not  only  can’t  the  sys¬ 
tem  find  the  optimal  policy  it  actually  moves  away 
from  it.  In  general,  the  system  will  oscillate  among 
policies,  never  finding  a  stable  one. 

4  Dealing  wlLh  FercepLual  Aliasing 

The  main  result  in  this  paper  is  a  decision  system 
based  on  reinforcement  learning  that  can  cope  with 
perceptual  ambiguity.  The  new  algorithm  leads  to 
a  control  architecture  that  not  only  learns  the  cor¬ 
rect  overt  actions  needed  to  solve  a  problem,  but  also 
learns  to  focus  its  attention  on  the  objects  in  the  world 
that  are  relevant  to  the  task.  The  design  is  based  on 
three  observations.  First,  in  active  perception  a  world 
state  can  be  represented  by  multiple  internal  states. 
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one  of  which  is  usually  unambiguous.  That  is,  in  any 
given  state,  if  the  system  looks  around  enough  it  will 
eventually  attend  to  those  objects  that  are  relevant  to 
the  task  and  the  internal  state  associated  with  that 
sensory  configuration  is  unambiguous.  Our  algorithm 
depends  on  the  existence  of  one  such  unambiguous  in¬ 
ternal  state  for  each  world  .state.  Second,  perceptually 
ambiguous  states  disrupt  the  decision  system’s  ability 
to  learn  by  promising  erroneously  large  expected  re¬ 
turns.  If  we  can  detect  perceptually  ambiguous  states 
and  actively  lower  their  return  estimates,  we  can  mini¬ 
mize  their  negative  affects.  Third,  if  the  world  is  deter¬ 
ministic,  then  perceptually  ambiguous  states  will  pe¬ 
riodically  overestimate  their  true  evaluations,  whereas 
the  incidence  of  overestimation  in  unambiguous  states 
diminishes  with  time.  Therefore,  perceptually  ambigu¬ 
ous  states  can  be  detected  by  monitoring  the  sign  of 
the  estimation  error  in  the  updating  rule.  Equation  5. 

The  new  algorithm  is  outlined  in  Figure  6.  The  .sys¬ 
tem  recognizes  two  classes  of  action;  overt  actions  and 
perceptual  actions.  Overt  actions  change  the  state  of 
the  external  world  whereas  perceptual  actions  change 
the  mapping  between  world  and  the  internal  states 
The  main  decision  cycle  is  the  overt  cycle  which  con¬ 
cerns  itself  with  choosing  overt  actions  in  an  attempt 
to  maximize  return.  Embedded  within  the  overt  cycle 
is  a  perceptual  cycle.  After  each  overt  action,  the  sys¬ 
tem  executes  a  series  of  perceptual  actions  (the  percep¬ 
tual  cycle)  in  an  attempt  to  assess  the  true  state  of  the 
external  world.  The  objective  of  the  perceptual  cycle 
is  to  find  an  internal  state  that  unambiguously  repre¬ 
sents  the  current  world  state.  Upon  completion,  the 
perceptual  cycle  returns  a  list  of  the  internal  states  en¬ 
countered  during  the  perceptual  cycle,  St-  Each  state 
corresponds  to  a  different  view  (representation)  of  the 
current  external  world.  The  utility  of  the  world  state 
at  time  t,  Ut,  is  estimated  as  the  maximum  utility 
of  the  individual  internal  states,  max,^s,{U{s)).  As 
will  be  described  below,  our  algorithm  for  adjusting 
the  utility  estimates  of  internal  states  severely  lowers 
the  utilities  of  perceptually  ambiguous  states.  Conse¬ 
quently,  utility  estimates  for  world  states  tend  to  be 
based  on  actual  utilities  of  unambiguous  states  and  not 
biased  by  aberrational  maxima  associated  with  percep¬ 
tually  ambiguous  states.  Once  Ut  has  been  estimated, 
the  Q  estimates  for  the  previous  overt  action  are  up¬ 
dated.  The  overt  cycle  then  continues  by  selecting  an 
overt  action  to  execute.  Some  high  fraction  of  the  time 
(90%)  the  system  chooses  the  action  consistent  with 
its  policy;  the  rest  of  the  time  (10%)  it  chooses  an  ac¬ 
tion  at  random.  When  following  policy,  the  action  is 
chosen  by  searching  among  active  internal  states,  St, 
for  the  decision  with  the  largest  action-value.  Once 
an  overt  action  is  chosen,  it  is  executed  and  the  overt 
cycle  begins  anew. 


*As  aside  effect,  overt  actions  may  also  change  the  per¬ 
ceptual  configuration,  but  perceptual  actions  are  not  al¬ 
lowed  to  affect  the  world  state.  In  the  indexical  sensory- 
motor  system  described  in  Figure  2,  the  action-frame  has 
only  overt  actions,  and  the  attentio.n-frame  has  only  per¬ 
ceptual  actions. 
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Figure  6:  An  outline  of  the  steps  executed  by  the  new 
decision  system  designed  specifically  to  overcome  the 
difficulties  caused  by  perceptual  aliasing. 


The  standard  Q-learning  algorithm  defines  the 
action-value  of  a  decision  as  the  return  the  system 
expects  to  receive  given  that  system  makes  that  de¬ 
cision  and  follows  its  policy  thereafter,  as  character¬ 
ized  by  Equation  3.  However,  for  perceptually  am¬ 
biguous  states  this  definition  leads  to  artificially  high 
action- values  (aberrational  maxima).  We  have  devel¬ 
oped  a  modified  learning  algorithm  that  is  based  on 

Q-learning  but  incorporates  a  com.petitive  component, 
onu:.. 

JL  KIO  VS.aiMO  c  «««.  OVOIV/AA  tCAAMV.'O 


perceptually  ambiguous  states  while  allowing  action- 
values  for  unambiguous  states  to  take  on  their  nominal 
values.  The  result  is  that  utility  estimates  for  world 
states  are  more  accurate  since  they  are  based  on  pre¬ 
dictions  from  unambiguous  internal  states  and  not  on 
perceptually  ambiguous  states. 

Our  modified  learning  algorithm  is  based  on  iden¬ 
tifying  one  internal  state,  among  St,  that  takes  the 
lion’.s  share  of  the  responsibility  (credit  or  blame)  for 
the  outcome  of  the  current  decision.  We  identify  this 
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state  as  the  lion.  If  the  system  is  following  its  pol¬ 
icy  then  the  lion  is  defined  as:  lion  =  si  such  that 
Q(s/,aj)  =  max,es,,a6^o(9(®>®))- 
tern  chooses  a  random  action,  Orandom,  the  lion  is 
defined  as:  lion  =  sj  such  that  Q(sj, Urandom)  = 
max,g5,((3(s, Orandom))-  The  idea  underlying  the  use 
of  a  lion  is  that  the  lion  state  should  be  an  internal 
state  that  unambiguously  represents  the  current  world 
state,  and  once  such  a  state  is  found  it  is  used  to  di¬ 
rect  all  actions  associated  with  the  world  state  it  rep¬ 
resents. 

Perceptually  ambiguous  lions  are  detected  and  sup¬ 
pressed  as  follows.  If  the  action-value  associated  with 
the  lion,  Q(s(,  at),  is  greater  than  the  estimated  return 
obtained  after  one  step,  rt  +  7f/(si+i),  then  the  lion 
is  suspected  of  being  ambiguous  and  the  action-value 
associated  with  it suppressed  (e.g  reset  to  0.0).  Ac¬ 
tively  reducing  the  action-values  of  lions  that  are  sus¬ 
pected  of  being  ambiguous  gives  other  (possibly  un¬ 
ambiguous)  internal  states  an  opportunity  to  become 
lions.  If  the  lion  does  not  over-estimate  the  return  it  is 
updated  using  the  standard  1-step  Q-learning  rule.  To 
prevent  ambiguous  states  from  climbing  back  into  con¬ 
tention,  the  estimates  for  non-lion  states  are  updated 
at  a  much  lower  learning  rate  and  only  in  proportion 
to  the  error  in  the  lion’s  estimate.  The  observation 
that  allows  this  algorithm  to  work  is  that  ambigu¬ 
ous  states  will  eventually  (one  time  or  another)  over¬ 
estimate  action-values,  consequently  they  will  eventu¬ 
ally  be  suppressed.  On  the  other  hand,  it  can  be  shown 
that  an  unambiguous  lion  is  stable  (i.e.  will  not  over 
estimate  its  action-value)  if  every  state  between  the 
lion’s  state  and  the  goal  also  has  an  unambiguous  lion. 
Thus,  ambiguous  states  are  unstable  with  respect  to  li- 
onhood,  while  unambiguous  states  eventually  become 
stable.  The  steps  for  updating  action- values  are  shown 
in  Figure  6  under  the  Update-Overt-Q-Estimates 
heading. 

The  steps  in  the  perceptual  cycle  are  sketched  in 
Figure  6  under  the  Perceptual  Cycle  heading.  The 
objective  of  the  perceptual  cycle  is  to  accumulate  a 
set  of  internal  representations  of  the  external  world, 
one  of  which  is  unambiguous.  This  is  achieved  by 
executing  a  series  of  perceptual  actions.  In  our  cur¬ 
rent  implementation,  each  perceptual  cycle  executes 
a  fixed  number  (n  =  4)  of  perceptual  actions.  This 
has  proven  adequate  for  our  experiments,  however  it 
is  easy  to  imagine  variable  length  perceptual  cycles,  in 
which  the  cycle  terminates  as  soon  as  an  unambiguous 
internal  state  is  found  or  increases  when  ambiguous 
states  are  encountered.®  The  algorithm  for  selecting 
actions  within  the  perceptual  cycle  is  similar  to  the 
algorithm  for  choosing  overt  actions  in  the  overt  cycle. 
The  vast  majority  of  the  time,  (90%),  the  system  fol¬ 
lows  its  policy  and  a  small  fraction  of  the  time,  (10%), 
a  perceptual  action  is  selected  at  random.  When  fol- 

®Actually,  it  may  be  possible  to  eliminate  the  distinc¬ 
tion  between  the  overt  cycle  and  the  perceptual  cycle  and 
integrate  them  into  a  single  cycle,  in  which  the  action  (ei¬ 
ther  overt  or  perceptual)  with  the  highest  utility  is  chosen. 
We  are  currently  experimenting  with  such  an  algorithm. 


Figure  7:  A  sequence  of  world  states  in  a  typical  solu¬ 
tion  path  for  the  block  manipulation  task.  Depending 
upon  the  placement  of  the  attention  frame,  states  1, 
3,  4,  5,  and  6  may  be  represented  ambiguously. 


lowing  policy,  the  action  selected  is  the  perceptual  ac¬ 
tion  Qp  such  that  Q(s,ap)  =  maxjgXp(Q(S)i'))  where 
s  is  the  system’s  current  internal  state.  That  is,  the 
policy  calls  for  perceptual  actions  that  lead  to  internal 
states  with  maximal  expected  returns. 

The  rules  for  updating  action-values  for  perceptual 
actions  are  those  for  standard  1-step  Q-learning  and 
are  shown  in  Figure  6  within  the  Perceptual  Cy¬ 
cle  procedure.  These  updating  rules  lead  to  action- 
values  that  estimate  the  average  utility  of  the  states 
that  result  from  executing  a  perceptual  action.  Since 
unambiguous  states  tend  to  have  higher  utilities  than 
ambiguous  states  (which  are  suppressed),  the  effect  is 
to  choose  perceptual  actions  that  lead  to  unambiguous 
internal  states. 

5  An  Example 

To  test  our  ideas  we  have  implemented  a  system  that 
learns  a  very  simple  block  manipulation  task.  The  task 
goes  as  follows  The  agent  is  presented  with  a  pile  of 
blocks  on  a  conveyor  belt.  The  agent  can  manipulate 
the  pile  by  picking  and  placing  blocks.  When  the  agent 
arranges  the  blocks  in  certain  goal  configurations,  it 
receives  a  fixed  reward  of  5000  units.  Otherwise  it  re¬ 
ceives  no  reward.  When  the  agent  solves  the  puzzle 
the  pile  immediately  disappears  and  a  new  pile  comes 
down  the  belt.  If  the  agent  fails  to  solve  the  puzzle 
after  75  steps,  the  pile  falls  off  the  end  of  the  conveyor 
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any  number  of  blocks  in  it  and  can  be  arranged  in  ar¬ 
bitrary  stacks.  A  block  can  be  any  one  of  three  colors: 
red,  green,  or  blue.  The  robot’s  sensory-motor  system 
is  the  indexical  system  described  in  Figure  2.  Also,  we 
make  the  standard  assumptions  that  a  block  can  only 
be  picked  up  if  its  top  is  clear  and  that  a  block  can 
only  be  placed  at  locations  with  clear  tops. 

The  particular  task  we  studied  rewards  the  agent 
whenever  it  picks  up  a  green  block.  That  is,  goal 
configurations  consists  of  of  those  states  in  which  the 
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robot  is  holding  a  green  block.  We  chose  to  study 
this  task  because  it  is  very  simple  but  adequate  to 
demonstrate  the  difficulties  cause  by  perceptual  ambi¬ 
guity.  These  problems  can  be  seen  in  Figure  7  which 
shows  the  sequence  of  world  states  the  agent  traverses 
in  solving  one  instance  of  a  problem  whose  initial  state 
consists  a  red  block  on  a  green  block  and  a  blue  block 
on  the  table.  Depending  on  the  placement  of  the  at¬ 
tention  frame,  world  states  1,3, 4, 5,  and  6  may  have 
ambiguous  internal  representations.  If  the  attention 
frame  is  fixed  on  the  green  block,  then  the  states  are 
unambiguously  represented.  However,  if  the  atten¬ 
tion  frame  is  fixed  on  the  blue  block  then  the  internal 
representation  of  the  states  is  ambiguous.  For  exam¬ 
ple,  in  state  6,  if  the  attention  frame  is  fixed  on  the 
blue  block  then  this  state  cannot  be  distinguished  from 
other  world  states  that  are  identical  except  with  addi¬ 
tional  blocks  above  the  green  block.  The  system  over¬ 
comes  ambiguities  by  learning  to  direct  its  attention 
frame  to  the  green  block,  which  provides  sufficient  ex¬ 
tra  information  (the  height  of  the  green  stack)  needed 
to  disambiguate  the  situation. 

Experiments  were  performed  to  obtain  quantitative 
data  on  the  new  decision  system’s  performance.  Each 
experiment  consisted  of  presenting  the  robot  with  a 
sequence  of  500  piles  of  4  randomly  configured  blocks, 
with  each  pile  containing  one  green  block.  The  robot’s 
performance  is  illustrated  in  Figure  8.  The  figure 
shows  the  number  of  steps  required  to  solve  the  prob¬ 
lem  (averaged  over  20  runs  of  500  piles)  as  a  function 
of  the  number  of  trials  seen.  Initially  the  number  of 
steps  required  is  high,  near  the  maximum  of  75,  since 
the  robot  thrashes  around  randomly  searching  for  re¬ 
inforcement.  However,  as  the  robot  begins  to  solve  a 
fev/  problems  its  experience  begins  to  accumulate  and 
it  develops  a  general  strategy  for  obtaining  reward.  By 
the  end  of  the  experiment,  the  time  required  to  solve 
the  problem  is  close  to  optimal.  The  system’s  perfor¬ 
mance  doesn’t  converge  to  optimal  since  10%  of  the 
time  it  chooses  a  random  action.  This  anomaly  re¬ 
flects  a  simplification  in  our  decision  algorithm  that 
can  be  eliminated  by  incorporating  a  slightly  more 
complex  procedure  for  controlling  exploration  [Barto 
ei  al,  1989]. 

6  Conclusions 

In  this  paper,  we  have  considered  the  interactions 
that  arise  in  adaptive  control  architectures  that  inte¬ 
grate  active  sensory-motor  systems  (specifically  index- 
ical  representations)  with  decision  systems  based  on 
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trivial  since  active  sensory-motor  systems  naturally 
lead  to  internal  states  that  are  perceptually  ambigu¬ 
ous.  Perceptual  ambiguity  wreaks  havoc  on  the  deci¬ 
sion  system’s  ability  to  learn  since  it  introduces  aber¬ 
rational  maxima  in  the  evaluation  function  and  desta¬ 
bilizes  the  optimal  policy.  A  solution  to  this  problem 
was  proposed  that  is  based  on  actively  detecting  and 
suppressing  ambiguous  states.  The  result  is  a  system 
that  learns  to  focus  its  attention  on  the  relevant  as¬ 
pects  of  the  domain  as  well  as  control  its  overt  behav- 


Number  of  trull 

Figure  8:  A  plot  of  the  average  number  of  steps  re¬ 
quired  to  solve  the  block  manipulation  task  as  a  func¬ 
tion  of  the  trails  seen  by  the  agent. 

ior.  The  new  algorithm  was  demonstrated  in  a  system 
that  learns  a  simple  block  manipulation  task  that  is 
beyond  the  scope  of  previous  reinforcement  learning 
systems.  Although  our  sy.stems  are  still  very  primi¬ 
tive,  we  find  the  results  encouraging  and  are  hopeful 
that  continued  effort  will  yield  systems  capable  of  more 
sophisticated  behavior. 
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Abstract 

Many  machine  learning  systems  learn  from 
their  problem  solving  experience  in  single¬ 
agent  domains.  This  paper  discusses  an 
algorithm  to  learn  complex,  relevant,  cost 
effective  plans  for  a  broad  class  of 
competitive,  multi-agent  domains.  Such  a 
plan,  called  a  fork,  is  extracted  from  the 
explanation  of  a  single  failure,  and  represents  a 
set  of  mutually  overlapping  simple  plans  to 
achieve  a  goal.  Each  plan  is  non-linear, 
provides  alternatives  for  contingencies,  is 
applicable  both  offensively  and  defensively, 
and  has  a  clear  upper  bound  for  its  execution 
time.  This  paper  describes  the  representation 
and  implementation  of  forks  in  HOYLE,  a 
system  to  learn  to  play  any  two-person,  perfect 
information  game  well.  Together  with  a  weak 
theory  for  its  general  domain,  HOYLE  has 
used  forks  to  learn  to  play  a  broad  variety  of 
games  perfectly  without  extensive  forward 
search  in  the  game  tree.  In  more  challenging 
games,  however,  the  selection  of  a  relevant 
plan  and  its  binding  to  the  current  game  state  is 
unacceptably  costly.  This  paper  details  the 
significantly  improved  performance  directly 
attributable  to  learning  about  appropriate 
forks,  and  the  heuristics  that  guard  against 
unacceptable  degradation  of  performance  as 
new  knowledge  is  acquired. 

1.  Introduction 

From  their  problem  solving  experiences,  many  machine 
learning  programs  extract  and  reformulate  information 
that  can  improve  their  performance  in  single-agent 
domains.  Many  of  liiese  programs  (e.g.,  Fikes,  Hart,  & 
Nilsson,  1972;  Korf,  1985;  Laird,  Rosenbloom,  & 
Newell,  1986)  have  learned  macro-operators  for 
planning.  Others  (e.g.,  Langley,  1985;  Minton,  1988; 
Mitchell,  Utgoff,  &  Banerjl,  1983)  have  learned  heuristic 
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rules  to  improve  their  search  performance  in  planning 
domains.  Recent  research  in  explanation-based  learning 
(e.g.,  Minton,  1988;  Mooney,  1988)  has  demonstrated 
how  one  problem-solving  example  can  provide  relevant, 
cost  effective  knowledge  within  a  broad  domain.  In  a 
domain  with  more  than  one  agent,  however,  credit  and 
blame  assignment  is  more  difficult,  the  next  choice  point 
for  each  agent  is  not  absolutely  predictable,  and  the 
absolute  merit  of  goal  states  and  paths  to  them  is  difficult 
to  assess.  There  has  been  little  work  on  learning  useful 
plans  for  broad  domains  with  more  than  one  agent. 
Minton  (1984)  described  an  algorithm  that  learned 
recognition  rules  for  plans  in  a  two-agent  domain.  These 
rules,  however,  were  limited  to  chess  and  slowed  the 
system  ’'dramatically." 

This  paper  describes  an  algorithm  to  learn  a  class  of 
extremely  powerful  plans  in  a  domain  where  two  or  more 
adversarial  agents  perform  some  sequence  of 
unretractable,  possibly  interfering  actions  in  a  race  to 
achieve  some  goal.  The  algorithm  has  been  implemented 
within  HOYLE,  a  machine  learning  program  for  two- 
person,  perfect  information  games  (Epstein,  1989a).  The 
plans  are  called  forks;  they  are  game-independent, 
partially  ordered,  and  applicable  both  offensively  and 
defensively.  Each  plan  provides  alternatives  for 
contingencies  and  has  a  known  upper  bound  on  its 
execution  time.  Together  with  a  weak  theory  for  its 
general  domain,  HOYLE  has  previously  used  such 
preselected  forks  to  learn  to  play  a  broad  variety  of 
games  perfectly  without  extensive  forward  search  in  the 
game  tree.  In  more  challenging  games,  however,  the 
selection  of  a  relevant  plan  and  its  binding  to  the  current 
game  state  is  unacceptably  costly.  The  algorithm 
discussed  here  enables  HOYLE  to  learn  a  relevant  plan 
as  an  explanation  of  a  single  losing  experience,  and  to 
improve  its  performance  significantly.  Heuristics  guard 
against  unacceptable  degradation  of  performance  as  new 
knowledge  is  acquired,  and  support  die  correct  offensive 
and  defensive  application  of  these  plans. 

2.  An  Overview  of  HOYLE 

HOYLE  is  a  system  that  learns  to  play  a  broad  class  of 
two-person,  perfect  information  games  extremely  well. 
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A  two-person,  perfect  information  game  is  one  in  which 
all  information  about  the  game  is  disclosed  and  equally 
available  to  its  two  participants,  e.g.,  there  are  no  closed 
hands  and  no  uncertain  outcomes  as  with  dice.  Any  such 
game  can  be  represented  as  a  finite,  directed  graph  (the 
game  graph)  in  which  a  node  (s’ate)  both  identifies  the 
participant  whose  turn  it  is  (tlie  mover)  and  describes  a 
possible  arrangement  of  ',ne  material  (the  board  and 
playing  pieces).  For  the  ;,ames  considered  here,  there  are 
exactly  two  participants:  Player  (the  one  who  moves 
first)  and  Opponent.  Player  and  Opponent  are  the  roles 
in  a  two-person  game.  During  a  contest  (a  path  in  a 
game  graph  that  represents  one  complete  experience  of 
the  game),  Player  tries  to  arrive  at  a  terminal  node 
categorized  as  a  win  and  Opponent  tries  to  arrive  at  a 
terminal  node  categorized  as  a  loss. 

Since  from  any  state  there  are  usually  many  legal 
moves,  the  challenge  in  play  is  for  the  mover  to  select  the 
best  move,  i.e.,  one  that  will  maximize  the  mover's 
opportunity  to  arrive  at  a  desired  result,  while 
minimizing  the  non-mover's  chances  to  arrive  at  her  own 
desired  result.  True  expertise  at  a  game  is  perfectly 
backed  up  knowledge  from  its  entire  game  graph.  Most 
interesting  games  have  extremely  large  game  graphs;  for 
them,  true  expertise  through  exhaustive  search  of  the 
game  graph  is  computationally  intractable.  Instead, 
computer  game  playing  programs  usually  approximate 
true  expertise  by  heuristic  search  in  the  game  graph. 
From  a  given  node,  such  a  program  estimates  the  unseen 
finishes  of  deeper  intermediate  nodes  and  backs  up  this 
approximate  knowledge  through  the  game  graph  to  the 
given  node  to  select  the  best  possible  legal  move. 
Human  experts  can  foil  such  programs  by  executing 
plans  whose  threat  lies  at  a  deeper  level  than  the 
programs  search. 

HOYLE  accepts  as  input  a  representation  of  some 
game:  its  material,  interface,  and  rules.  The  interface 
provides  the  graphics  for  the  game  and  interactive 
communication  with  another  participant  through  the 
keyboard.  The  rules  describe  the  legal  moves,  when  a 
contest  at  the  game  terminates,  and  how  to  determine  the 
outcome  of  that  contest.  From  this  input  HOYLE  is  able 
to  play  any  such  game  correctly,  i.e.,  according  to  the 
rules;  HOYLE's  task  is  to  learn  to  play  it  perfectly. 
Without  perfect  knowledge,  the  performance  of  a  game¬ 
playing  program  must  be  evaluated  empirically,  by 
playing  contests.  HOYLE  is  evaluated  against  two 
criteria:  how  well  it  learns  to  play  in  a  tournament  (a 
sequence  of  contests  where  the  participants  alternate 
roles)  against  a  human  expert  and  how  quickly  it 
achieves  that  skill.  Measures  of  skill  include  ability  to 
win,  ability  to  draw,  and  the  duration  of  contests  played. 


measured  in  average  number  of  moves. 

HOYLE  is  a  limitedly  rational  program  that  learns  from 
its  experience.  Compared  to  traditional  one-game 
programs  (e.g.,  Anantharaman,  Campbell,  &  Hsu,  1988; 
Berliner,  1980;  Berliner  &  Ebeling,  1989;  Lee  & 
Mahajan,  1988;  Roscnbloom,  1982;  Samuel,  1967; 
Schaeffer,  1988),  HOYLE  does  relatively  little  forward 
search  into  the  game  tree,  has  a  limited  memory,  and  has 
no  heuristics  for  symmetry.  Instead,  HOYLE  makes 
decisions  based  on  a  variety  of  advice.  HOYLE's  thesis 
is  that,  in  learning  to  play  any  new  game,  a  person 
actually  has  relevant  prior  knowledge  about  it,  a  weak 
theory  for  the  general  domain  of  two-person,  perfect 
information  games.  This  weak  theory  encapsulates  the 
pers  on's  experience  from  contests  at  other  games,  and 
enables  her  to  learn  to  play  any  such  game  well. 
HOYLE's  weak  theory  includes  a  set  of  valid,  but 
deliberately  narrow,  viewpoints  (HOYLE's  Advisors)  that 
are  modified  by  HOYLE's  experience,  i.e.,  learn.  When 
it  is  HOYLE's  turn  to  move,  each  Advisor  generates  one 
or  more  comments  about  the  current  state.  To  select  a 
move,  HOYLE  weighs  the  comments  carefully  against 
each  other  (Epstein,  1989b)  according  to  the  game,  the 
contest,  and  the  nature  of  the  participants.  Among 
HOYLE's  Advisors  is  Pitchfork,  a  procedure  that  exploits 
certain  very  powerful  plans,  called  forks.  The  way  this 
Advisor  learns  the  particular  forks  effective  in  a  given 
game  is  the  subject  of  this  paper. 

3.  An  Example  of  a  Fork  in  Execution 

Assume  a  domain  where  a  finite  set  of  actions  is 
available,  two  or  more  agents  alternately  take  a  single 
action,  and  no  action  may  be  taken  by  more  than  one 
agent.  A  simple  plan  in  such  a  domain  is  an  unordered 
set  of  unretractable  actions  that  achieve  a  goal. 
Informally,  a  fork  is  two  or  more  overlapping  simple 
plans  belonging  to  the  same  agent,  directed  toward 
different  goals,  and  having  one  or  more  common  actions. 
(Although  the  implementation  described  here  is  for 
games  and  two  agents,  the  algorithm  is  limited  only  by 
the  definitions  of  simple  plan  and  fork.)  Figure  1  shows 
an  example  of  a  fork  used  in  tic-tac-toe  where  Player's 
most  recent  move  (action),  in  position  3,  threatens  a  win 
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Figure  2:  A  Game  Stale  and  Its  Strategic  Maps 


in  both  the  first  row  and  the  third  column  (the  goals). 
Opponent  is  defenseless  before  this  onslaught. 

Banerji  (1980)  devised  a  graph  representation  for  the 
mover's  situation  in  a  game  focused  upon  certain  patterns 
called  "kernels."  An  early  implementation  of  Banerji's 
ideas  (Koffman,  1968;  King,  1970)  used  a  few  simple 
kernels  to  suggest  moves  iri  several  games,  like  tic-tac- 
toe.  HOYLE  has  continued  this  research;  it  extends 
Banerji's  original  terminology,  defines  and  explains  the 
significance  of  a  fork's  depth,  explores  the  generation  and 
application  of  forks,  and  integrates  forks  with  other 
techniques  to  support  both  offensive  (goal-forwarding) 
and  defensive  (god-inhibiting)  play  (Epstein,  1990). 

4.  Definition  and  Representation  of  Forks 

In  a  task  with  only  unretractable  actions  (actions  from 
slates  that  do  not  lie  on  a  cycle  in  the  search  space),  a 
strategic  map  for  any  state  can  be  constructed  for  each 
agent.  A  strategic  map  for  a  state  and  agent  is  a  labeled, 
undirected  bipartite  graph  in  which  one  set  of  nodes  (the 
top  vertices)  is  labeled  with  the  actions  available  to  that 
agent  from  that  stale,  the  second  set  (the  bottom  vertices) 
is  labeled  with  the  agent's  simple  plans  to  achieve  her 
goal  and  an  edge  joins  any  action  available  to  an  agent 
with  each  simple  plan  it  forwards.  Thus  the  strategic 
map  for  an  agent  at  a  state  represents  all  the  goal¬ 
forwarding  options  open  to  her.  A  strategic  map  may  be 
denoted  as  <T,B,E>,  where  T  is  a  set  of  top  vertices,  B  a 


set  of  bottom  vertices,  and  E  a  set  of  undirected  edges 
from  vertices  in  T  to  vertices  in  B.  In  Figure  2,  for 
example,  positions  2,  3,  4,  6,  7,  8,  and  9  are  available  to 
both  participants;  rows  1  and  3  and  columns  1  and  3  are 
simple  plans  for  Player;  and  rows  2  and  3,  columns  2 
and  3,  and  the  minor  (from  upper  right  to  lower  left) 
diagonal  are  simple  plans  for  Opponent.  Figure  2  shows 
the  strategic  maps  for  Player  and  Opponent  for  the  given 
board  position. 

Certain  connected  subgraphs  of  a  strategic  map,  induced 
by  a  subset  of  its  bottom  vertices,  represent  the 
overlapping  of  more  than  one  of  the  agent's  simple  plans. 
These  subgraphs  are  denoted  here  as  F-dn,  where  F 
stands  for  "fork,"  d  is  its  depth,  to  be  defined  shortly,  and 
n  provides  a  distinct  label.  For  example,  in  a  subgraph  of 
the  form  F-31  in  Figure  3(a),  either  action  represented 
by  a  degree-two  vertex  at  the  top  furthers  both  the  simple 
plans  represented  by  the  bottom  vertices  adjacent  to  it. 
Such  an  action,  on  the  left  for  example,  transforms  F-31 
into  Figure  3(b),  two  connected  graphs,  of  the  forms  F-1 
on  the  left  and  F-21  on  the  right.  (Immediately  before 
her  most  recent  move  into  position  3  in  Figure  1,  Player's 
strategic  map  was  isomorpnic  to  F-21,  where  the  top 
vertices  represented  2,  3  and  9  and  the  bottom  vertices 
represented  Row  1  and  Column  3.)  Subsequently  taking 
the  action  represented  by  die  second  degree-two  position 
U'ansforms  Figure  3(b)  into  Figure  3(c),  three  copies  of  F- 
1.  Thus  there  ate  two  actions  represented  in  F-31  that, 
taken  in  any  order,  forward  among  them  three  simple 
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plans. 

Formally,  a  fork  is  defined  to  be  either  F-1  = 
<(t),{bl,{(t,b))>  with  distinguished  vertex  (key)  t,  or  an 
undirected  bipartite  graph  containing  at  least  one 
distinguished  top  vertex  whose  removal,  along  with  its 
edges,  disconnects  the  graph  into  two  or  more  connected 
components,  each  of  which  is  also  a  fork.  The  set  of  all 
forks  is  completely  and  correctly  defined  recursively  as: 

•  F-1  is  a  fork. 

•  The  following  construction  produces  a  fork;  Copy  any 
n  forks.  (They  are  the  parents  of  the  new  fork  under 
construction.)  Add  a  new  top  vertex  v.  (This  is  the  join 
of  the  new  fork  under  construction.)  Add  an  edge  from  v 
to  at  least  one  bottom  vertex  in  each  parent. 

•  Nothing  else  is  a  fork. 

Observe  that  removal  of  the  join  and  its  associated 
edges  transforms  the  graph  into  n  connected  graphs,  its 
parents,  each  of  which  is  itself  a  fork.  F-21,  for  example, 
is  the  result  of  joining  two  copies  of  F-1.  The  removal  of 
a  key  and  its  associated  edges  is  called  an  execution  of 
the  fork. 

The  fork  F-1  has  depth  1,  and  for  d  >1,  a  fork  of  depth  d 
is  a  fork  such  that  removal  of  any  one  of  its  distinguished 
vertices  produces  at  least  one  connected  component  that 
is  itself  a  fork  of  depth  d-1,  and  d  is  maximal.  By  this 
definition,  for  example,  F-21  is  a  fork  of  depth  two  and 
F-31  is  a  fork  of  depth  three.  When  some  subset  of  the 
bottom  vertices  of  an  agent’s  strategic  map  induces  a 
copy  of  a  fork,  the  depth  of  that  fork  may  be  interpreted 
as  an  upper  bound  on  the  number  of  actions  until  the 
completion  of  one  of  those  goals  by  the  agent.  For 
example,  if  F-31  is  such  an  induced  subgraph,  the  agent 
is  certain  to  achieve  one  of  her  goals  after  three  actions. 
(Reaching  a  goal  state  sooner  is  possible,  of  course,  but 
against  optimal  defense,  success  will  require  three 
actions.)  The  keys  and  depth  of  a  fork  are  non-trivial 
computations,  however  (Epstein,  1990).  It  is  not  the 
case,  for  example,  that  every  fork  built  by  combining  two 
smaller  forks  takes  longer  to  execute  than  its 
components.  How  quickly  a  fork  can  achieve  its 
objective,  quantified  here  as  its  depth,  must  be  computed 
as  the  minimum  of  all  fork  depths  arising  in  the  map 
when  the  first  agent  takes  any  top  vertex  and  then  the 
other  agent  takes  another. 

In  summary,  a  fork  of  depth  d  represents  a  game- 
independent  set  of  mutually  overlapping  simple  plans  for 
a  single  agent  to  achieve  a  goal  alter  at  most  d  actions. 
The  existence  of  at  least  one  key  in  every  fork  imposes  a 
partial  ordering  on  the  plan,  and  provides  alternative 
plans;  the  parent  forks  that  appear  as  vertices  arc 
removed.  The  power  of  a  fork  lies  in  the  ability  of  one 
action  to  forward  more  than  one  plan. 


5.  The  Application  of  Forks  in  T\vo«agent 
Domains 

Because  each  agent  in  every  state  has  a  strategic  map, 
forks  arc  applicable  as  both  offensive  and  defensive 
plans.  In  many  search  spaces,  plans  that  guarantee  the 
achievement  of  a  goal  state  with  d  actions  can  be 
delayed,  or  even  destroyed,  by  defensive  interference 
from  another  agent.  The  best  defense  against  any 
opposing  fork  is  to  take  the  action  in  one  of  its  keys 
yourself,  permanently  eliminating  many  of  the  other 
agent's  simple  plans.  For  example,  if  one  agent  has  a 
strategic  map  containing  F-31  from  Figure  3(a),  the  other 
agent  can  dcsU'oy  the  fork  by  taking  the  action  in  either 
of  the  degree-two  keys  (say,  the  right  one),  immediately 
reducing  that  strategic  map  to  Figure  3(d).  In  what 
remains  of  the  first  agent's  strategic  map,  the  only 
remaining  simple  plan  is  readily  prevented  on  the  other 
agent's  next  turn.  For  an  extended  example  of  how  this 
technique  results  in  highly  skilled  play  see  (Epstein, 
1990). 

Strategic  maps  must  be  thoughtfully  applied  in  the  play 
of  two-person,  perfect  information  games.  If  the  mover 
has  more  than  one  fork  in  her  strategic  map,  the  best 
offensive  move  (action)  is  always  a  key  in  the  shallowest 
fork,  that  is,  pursuit  of  her  shortest  certain  plan  to  a  win. 
(If  F-1  is  detected,  this  reduces  to  the  common  sense 
theory  "If  you  see  a  winning  move,  take  it.")  If  one 
participant’s  stfategic  map  contains  a  fork  F  of  depth  d,  a 
win  in  d  turns  is  guaranteed,  even  when  met  by  optimal 
defense,  but  only  when  the  second  participant  for,vards 
no  offensive  plans  of  her  own  and  it  is  the  first 
participant’s  turn.  If  the  second  pursues  a  shallower  fork 
of  her  own,  the  first  will  be  forced  repeatedly  to  defend 
against  it,  delaying  or  perhaps  even  losing  the  ability  to 
pursue  F  if  the  set  of  moves  is  small.  If  the  second 
participant  is  the  mover  and  plays  a  key  in  F,  the  first  will 
have  to  search  elsewhere  for  a  fork. 

The  application  of  forks  to  two-person,  perfect 
information  games  requires  extensive  computation,  both 
before  and  during  play.  For  a  machine  to  exploit  forks  to 
tlicir  full  advantage,  it  would  have  to  Identify  correctly 
every  fork,  its  depth,  and  its  keys  in  both  strategic  maps. 
One  way  to  recognize  forks  is  to  maintain  a  library  of 
tliem,  and  search  for  subsets  of  bottom  vertices  in  each 
strategic  map  that  induce  a  subgraph  isomorphic  to  an 
element  of  the  library.  Implementation  of  the  recursive 
consu^uclive  definition  of  a  fork  to  build  such  a  library  is 
non-trivial,  however.  The  definition  generates  infinitely 
many  forks  at  any  given  depth,  can  generate  isomorphs 
of  the  same  fork,  and  provides  only  an  upper  bound  on 
tlie  depth  of  the  new  fork.  HOYLE's  initial  foray  into 
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two-parent  fork  construction  through  depth  four 
demonstrated  the  anticipated  a  combinatoric  explosion. 
From  the  single  fork  F-1  at  depth  one,  HOYLE  built  one 
fork  at  depth  two,  5  distinct  forks  at  depth  three,  and  520 
distinct  forks  at  depth  four.  Ideally,  for  any  given  contest 
state,  HOYLE  should  consuruct  its  two  strategic  maps 
and  then  search  for  forks  in  them.  Search  for  induced 
isomorphic  subgraphs  is  notoriously  expensive,  however, 
well  beyond  the  resource  allocation  of  a  limitcdly 
rational  system  like  HOYLE.  Instead,  a  set  of  heuristics 
was  developed  for  HOYLE  to  capitalize  on  a  limited 
library  of  two-parent  forks,  both  offensively  and 
defensively. 

HOYLE'S  heuristics  exploit  the  stiategic  map  in  a 
variety  of  ways.  HOYLE  searches  only  for  two-parent 
forks  and  only  in  connected  components  of  a  strategic 
map.  The  program  divides  its  allocated  resources 


approximately  equally  between  offensive  search  of  its 
own  strategic  map  and  defensive  search  of  the  other 
participant's  strategic  map.  Search  heuristics  seek  the 
shallowest  forks  in  both  maps,  one  depth  value  at  a  time. 
Each  fork  is  screened  before  search  to  confirm  that  the 
number  of  its  top  vertices  does  not  exceed  the  current 
number  of  legal  moves  in  the  stale  and  that  no  bottom 
vertex  is  of  degree  larger  than  the  number  of  positions 
required  for  a  win.  Finally,  Advisors  other  than  Pitchfork 
have  access  to  the  strategic  maps  it  constructs;  they 
recommend  moves  based  on  properties  of  the  sU'atcgic 
maps  and  subgraphs  of  theni.  Further  details  and  a 
complete  algorithm  appear  in  (Epstein,  1990). 

For  the  simplest  games  in  HOYLE's  testbed,  this 
approach,  in  combination  with  the  other  Advisors  and  the 
control  structure,  was  able  to  learn  to  play  with  true 
expertise.  For  more  difficult  games,  particularly  those 
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with  larger  boards,  exhaustive  search  for  induced 
isomorphic  subgraphs  is  far  too  slow. 

6.  An  Algorithm  that  Learns  Forks 

When  it  is  HOYLE's  turn  to  move,  its  Advisor  Panic 
looks  to  see  if  the  non-mover  has  a  move  that  would  win 
the  game  (called  a  threat).  If  Panic  finds  any  threats,  it 
looks  for  moves  that  would  make  all  of  them  impossible 
on  the  non-mover's  next  turn.  In  the  event  that  Panic 
detects  a  threat  it  cannot  block,  it  correctly  tells  HOYLE 
to  resign,  because  it  is  overwhelmed.  This  resignation  is 
recorded  in  the  history,  the  trace  of  the  contest.  In  a 
positional  game  (one  like  tic-tac-toe  where  territory  is 
occupied  and  cannot  be  released),  a  resignation  by  Panic 
is  tantamount  to  an  admission  that  the  program  has  been 
forked.  HOYLE  takes  a  history  of  such  an  experience, 
and  backs  up  two  states  at  a  time  to  explain  its  defeat. 
This  procedure  uses  the  constructive  definition  of  fork  in 
Section  4  to  explain  how  the  loss  was  due  to  the 
execution  of  a  fork  and  precisely  which  fork  it  was.  The 
learned  fork,  together  with  the  heuristics  HOYLE  has  for 
applying  forks,  is  fully  operational. 

Consider,  for  example,  the  partial  history  of  a  Qubic 
contest  HOYLE  lost  as  shown  in  Figure  4.  Qubic  is  a 
three-dimensional  version  of  tic-tac-toe  played  on  four 
parallel  four-by-four  grids,  where  a  win  is  four  in  any 
straight  line.  Board  positions  in  the  plane  shown  are 
referenced  here  by  matrix  coefficients;  the  other  planes, 
where  the  other  four  O's  lie,  are  omitted.  The  major 
diagonal,  from  the  upper  left  to  the  lower  right,  is 
denoted  as  D,  and  the  minor  diagonal  as  d.  In  Figure  4(a) 
it  is  HOYLE's  turn  to  move  as  Opponent  (O's),  but  the 
strategic  map  for  Player  (X's)  indicates  that  there  are 
threats  at  (2,1)  and  (3,4);  taking  Panic's  advice,  HOYLE 
resigns.  Now  HOYLE  constructs  an  explanation  of  its 
loss,  beginning  with  Player's  final  strategic  map  of  two 
simple  plans.  The  graph  in  Figure  4(a)  is  isomorphic  to 
two  copies  of  F-1.  Two  moves  backward  at  a  time, 
HOYLE  modifies  this  graph  until  the  history  is  exhausted 
(in  which  case  there  was  no  fork)  or  the  graph  is 
connected.  To  move  from  Figure  4(a)  to  4(b),  for 
example,  HOYLE  retracts  Player's  last  move,  (3,1),  and 
HOYLE's  response,  (2,2),  to  Player's  earlier  threat  on  the 
minor  diagonal.  When  HOYLE  backs  up,  it  adds  to  the 
graph  as  top  vertices  any  moves  that  will  add  at  least  one 
edge  frern  them  to  new  simple  plans  of  degree  one  or  to 
preexisting  simple  plans.  Thus  the  graph  in  Figure  4(b) 
is  a  copy  of  that  in  4(a),  with  the  addition  of  the  vertices 
(3,1)  and  (2,2),  the  edge  to  the  new  simple  plan  D,  and 
the  edges  to  the  preexisting  simple  plans  for  the  first 
column  and  the  third  row.  The  graph  in  Figure  4(b)  is 


isomorphic  to  a  copy  of  F-1  and  a  copy  of  F-21. 
Similarly,  the  graph  in  Figure  4(c),  isomorphic  to  a  copy 
of  F-31  and  a  copy  of  F-1,  is  before  Player's  move  in 
(3,3)  and  HOYLE's  (2,3)  block  of  the  threat  on  the  minor 
diagonal.  Finally,  the  graph  in  Figure  4(d)  is  the  result  of 
withdrawing  Player's  move  in  (3,2)  and  whatever  move 
HOYLE  had  made  immediately  before  it.  (That  move 
introduced  no  new  edges  to  the  graph.)  At  this  point, 
reconstruction  stops  because  the  graph  in  Figure  4(d)  is 
connected,  i.e.,  HOYLE  has  learned  a  fork  of  depth  4. 

Without  its  labels,  the  fork  in  Figure  4(d)  is  both  a 
generalization  of  HOYLE's  experience  in  a  contest  and  a 
specialization  of  its  recursive  concept  definition  for 
forks.  Empirically,  this  plan  has  been  found  quite  useful 
both  offensively  and  defensively  in  other  Qubic  contests. 
Because  it  is  game-independent,  it  could  be  relevant  to 
other  games  as  well. 

7.  Measuring  Learning  in  a  Tournament 

Learning  in  a  tournament  can  be  measured  by  changes  in 
periormance.  When  a  game  is  well-understood 
mathematically,  the  cumulative  number  of  wins  and 
draws  for  true  expertise  can  be  plotted  as  a  function  of 
the  number  of  contests  played  (the  goal  curve).  In  a 
HOYLE  tournament,  the  cumulative  number  of  wins  and 
draws  for  the  program  is  calculated  as  a  function  of  the 
number  of  contests  played  (the  performance  curve). 
While  the  rate  of  change  in  the  distance  between  a 
performance  curve  and  the  goal  curve  is  decreasing,  the 
program  is  learning  to  play  better.  Once  its  performance 
curve  consistently  parallels  the  goal  curve,  HOYLE  may 
be  said  to  have  learned  to  play  with  true  expertise.  If  a 
program  loses  repeatedly,  but  is  able  to  prolong  its 
contests  as  time  passes,  it  is  learning;  a  program  that 
wins  or  draws  repeatedly  and  is  able  to  achieve  that  result 
more  quickly  as  time  passes  is  learning  too. 

Qubic  is  a  challenging  game.  There  are  76  possible 
winning  configurations  to  guard  against,  and  well  over  a 
billion  possible  states.  There  is  an  average  of  32  possible 
legal  moves  from  any  state.  It  has  been  shown 
mathematically  (Paul,  1978)  that  every  contest  between 
two  participants  with  perfect  knowledge  should  end  in  a 
tie.  Forks  of  at  least  depth  five  are  constructable  in 
Qubic,  well  beyond  the  average  human's  notice,  and 
outside  Pitchfork's  standard  library  as  well. 

KOYLL  has  played  scvcial  luuiiiaiiiciils  uf  20  Qubic 
contests  eai.h  against  an  expert  human,  with  a  resource 
limit  of  three  minutes  per  Advisor.  In  one  tournament 
(#1)  HOYLE  played  at  random,  making  legal  moves 
without  any  of  its  learning  facilities,  while  the  human 
pursued  any  simple  plan.  HOYLE  lost  every  contest 
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very  quickly,  averaging  7.6  moves  out  of  a  possible  64,  a 
lower  bound  on  performanee.  In  another  tournament 
(#2),  the  human  expert  repeatedly  played  the  depth  four 
fork  of  Figure  4  against  the  full  original  implementation 
of  HOYLE  with  a  library  of  forks  only  through  depth 
three.  During  #2,  Pitchfork  struggled  to  identify  forks 
within  its  time  constraints  and  regularly  signaled  that  it 
was  not  allocated  sufficient  resources.  By  the  point  in  the 
contest  that  Pitchfork,  with  its  limited  resources,  detected 
an  offensive  fork,  it  was  too  late  to  defend  against  it,  and 
HOYLE  conceded,  still  losing  every  contest.  The 
average  contest  in  #2  was  19.4  moves,  however; 
HOYLE  offered  much  stronger  competition  than  mere 
random  play.  The  handicaps  of  limited  rationality  (both 
as  a  time  limit  and  as  a  fork  depth  limit)  and  lack  of 
knowledge  of  symmetry  were  too  great  for  this  version  of 
HOYLE  to  make  any  visible  progress  at  learning  Qubic. 
Against  a  set  of  su-ong  game  players  who  were  not  Qubic 
experts,  however,  this  version  of  HOYLE  always  won. 

At  this  point  HOYLE  was  revised  as  follows.  Pitchfork 
was  modified  to  learn  forks,  as  described  here,  and 
instructed  to  consider  first  any  recently-learned  forks 
applicable  to  the  current  game.  Pitchfork  was  allocated 
five  minutes  of  computation  time.  (Each  Advisor  is 
otherwise  allocated  one  minute  and  typically  uses  only  a 
few  seconds.)  Pitchfork  was  told  to  recommend  any 
good  moves  found,  even  those  based  on  partial  search  of 
the  strategic  maps. 

During  initial  forays  at  Qubic,  this  modified  program 
learned  several  different  forks  of  depth  greater  than  three. 
At  least  one  of  these  was  more  elaborate  tlian  the  pattern 
the  human  expert  had  in  mind.  The  extended  time 
allocation  slowed  some  of  the  early  and  non-forced 
moves,  but  still  pern’itted  real  time  play;  no  contest  ran 
more  than  an  hour. 

The  revised  version  of  HOYLE,  without  this  initial 
experience,  played  a  tournament  of  10  contests  (#3).  In 
the  first  contest  of  #3,  HOYLE  was  defeated  by  the 
successful  fork  from  Figure  4.  After  the  contest, 
however.  Pitchfork  anal>zcd  the  history  and  learned  the 
fork,  placing  it  first  on  Pitchfork's  search  list.  In 
subsequent  contests,  each  time  the  human  expert 
attempted  that  fork,  HOYLE  identified  and  blocked  it 
successfully.  Play  was  at  an  extremely  high  level 
throughout  #3;  contests  now  averaged  39.8  moves  and 
competition  was  intense.  In  the  fourth  contest,  for 
example,  both  participants  had  the  fork  from  Figure  4  in 
their  strategic  maps  (HOYLE's  was  a  version  using 
central  spaces  rather  than  the  corners  it  had  learned  tlie 
fork  on).  Neither  participant  was  able  to  execute  tlie  fork 
because  of  interference  from  the  other,  and  the  contest 
was  a  draw.  In  the  seventh  contest,  tlie  human  expert  in 


one  state  had  two  different  isomorphs  of  the  fork  from 
Figure  4.  HOYLE  was  able  to  block  these,  execute 
successfully  a  shallower  fork  of  its  own,  and  go  on  to  win 
the  contest.  HOYLE  never  lost  a  contest  in  ft3  after  the 
first  one,  and  it  regularly  caught  the  human  expert  off 
guard,  winning  six  times. 

HOYLE  was  also  tested  on  a  3-by-3-by-3  version  of  tic- 
tac-toe,  a  somewhat  simpler  game  that  Player  should 
always  win.  The  original  program  learned  to  play 
perfeetly  after  several  contests.  The  fork-learning  version 
learned  to  play  with  true  expertise  just  as  quickly,  but  it 
also  learned  the  depth-four  fork  produced  by  the  opening 
move  in  the  post  mortem  for  the  first  contest.  (This  fork 
is  different  from  the  one  learned  in  Qubic.)  In  the 
remainder  of  the  tournament,  as  Opponent  HOYLE 
resigned  after  the  correct  opening;  as  Player  HOYLE 
declared  victory  before  it  made  a  single  move,  but  still 
went  on  to  play  perfectly  and  win  each  contest. 

8.  Results  and  Future  Work 

Previously,  a  few  sets  of  mutually  overlapping  simple 
plans  with  very  limited  applicability  were  identified  as 
naive  offensive  strategies  in  two-agent  domains. 
HOYLE's  game-independent,  graph  representation  for 
forks  includes  these  plans  and  provides  a  plan  generator. 
Each  fork  implicitly  contains  both  conjunctive  and 
disjunctive  descriptions.  Proper  application  of  the  seven 
smallest  forks  has  been  shown  empirically  to  produce 
high-quality  solutions  and  to  focus  attention  on  strategic 
possibilities  not  otherwise  likely  to  be  found  in  some 
large  search  spaces. 

hor  relatively  easy  gaiiics,  sliiipie  heuristics  to  construct, 
recall,  and  search  for  tlicsc  seven  forks  in  a  contest  will 
support  learning  to  play  perfectly.  For  more  difficult 
games,  however,  high  match  cost  and  utility  (Minton, 
1988)  become  issues.  Empirical  work  with  HOYLE  thus 
far  indicates  that,  perhaps  by  the  topology  of  their  boards 
and  the  nature  of  their  rules,  most  games  intrinsically 
have  a  limited  number  of  applicable  forks.  Each  larger 
fork  relevant  to  a  particular  game  is  readily  identified 
from  an  explanation  of  the  trace  of  a  defeat  at  that  game. 
Searching  first  for  these  learned  forks  in  subsequent 
contests  of  the  same  game  suffices  to  simulate  expert 
play. 

The  learned  plans  are  generalizations  potentially 
applicable  to  any  game,  with  a  well-defined  metric 
(depth)  that  supports  their  error-free  application  both 
offensively  and  defensively  without  forward  search  into 
the  game  graph,  and  an  implicit  game-independent 
execution  plan  that  provides  for  contingencies.  Because 
heuristics  are  used  in  the  implementation,  performance 
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may  be  imperfect.  In  its  current  game  testbed,  however, 
HOYLE'S  ability  to  learn  plans  from  observing  its  defeat 
at  the  hands  of  an  expert  rapidly  enables  it  to  learn  to 
win.  Finally,  although  the  abstractions  representing  forks 
were  designed  for  a  particular  class  of  games,  they  arc 
clearly  extendible  to  any  multi-agent  competitive 
planning  domain. 

Future  research  includes  the  extension  of  this  work  to 
contests  where  the  winner  deliberately  deflects  attention 
from  a  fork  being  executed,  to  games  where  moves  arc 
retractable,  and  to  games  where  the  goal  is  disabling 
playing  pieces  rather  than  achieving  a  specific 
configuration. 

These  results,  while  promising,  raise  an  interesting 
issue:  at  the  current  time,  HOYLE  is  not  an  aggressive 
player,  merely  an  opportunistic  one.  On  defense, 
HOYLE  detects  and  blocks  the  forks  it  knows,  within  its 
heuristic  consuaints.  On  offense,  however,  HOYLE  docs 
not  actively  plan  to  construct  a  state  in  which  it  will  have 
a  fork;  it  only  executes  forks  that  happen  to  lie  in  a  state. 
This  opportunistic  behavior  is  not  aggressive  play,  or 
even  aggressive  planning.  Current  research  includes 
near  forks,  aggressive  plans  that  give  rise  to  executable 
forks.  Until  HOYLE  takes  the  construction  of  a  fork, 
rather  than  its  execution,  as  a  goal,  the  program  will 
continue  to  play  excellent  defense  but  only  serendipitous 
offense. 
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Abstract 


Given  an  adequate  simulation  model  of  the 
task  environment  and  payoff  function  that 
measures  the  quality  of  partially  successful 
plans,  competition-based  heuristics  such  as 
genetic  algorithms  can  develop  high  perfor¬ 
mance  reactive  rules  for  interesting  sequen¬ 
tial  decision  tasks.  We  have  previously 
described  an  implemented  system,  called 
Samuel,  for  learning  reactive  plans  and 
have  shown  that  the  system  can  success¬ 
fully  leain  rules  for  a  laboratory  scale  tacti¬ 
cal  problem.  In  this  paper,  we  describe  a 
method  for  deriving  explanations  to  justify 
the  success  of  such  empirically  deriv^  rule 
sets.  The  method  consists  of  inferring 
plausible  subgoals  and  then  explaining  how 
the  reactive  rules  trigger  a  sequence  of 
actions  (i.e.,  a  strategy)  to  satisfy  the 
subgoals. 


1  Introduction 

This  report  is  part  of  an  on-going  study  con¬ 
cerning  learning  reactive  plans  for  sequential  decision 
tasks  given  a  simulation  of  the  (ask  environment.  In 
particular,  we  have  been  investigating  techniques  that 
allow  a  learning  system  to  actively  explore  alternative 
behaviors  in  simulation,  and  to  construct  high  perfor¬ 
mance  mles  from  this  experience  using  competition- 
based  methods.  Our  current  research  focuses  on 
learning  reactive  rules  for  a  variety  of  tactical 
scenarios.  Learning  tactical  rules  is  especially 
difficult  if  the  environment  is  only  partially  modeled, 
contains  other  independent  agents,  or  permits  only 
limited  sensing  of  inipoitanl  state  vaiiables.  Suut 
features  reduce  the  utility  of  traditional  projective 
problem  solving  (Mitchell,  1983;  Minton  et.  al,  1989) 
and  favor  the  use  of  reactive  control  mles  that 
respond  to  current  information  and  suggest  useful 
actions  (Agre  and  Chapman,  1987;  Schoppers,  1987). 


We  have  been  investigating  the  usefulness  of  genetic 
algorithms  and  other  competition-based  heuristics 
(Grefenstette,  1988)  to  learn  high  performance  reac¬ 
tive  rules  in  the  absence  of  a  strong  domain  theory. 
The  ^proach  has  been  implemented  in  a  system 
called  Samuel  (Grefenstette,  1989).  One  of  the 
important  differences  between  Samuel  and  many 
other  genetic  learning  systems  is  that  Samuel  learns 
mles  expressed  in  a  high  level  rule  language.  The  use 
of  a  symbolic  rule  language  is  intended  to  facilitate 
the  incorporation  of  more  powerful  learning  methods 
into  the  system  where  appropriate.  In  this  paper,  we 
investigate  the  use  of  explanation-based  learning 
methods  to  explain  the  success  of  the  empirically 
learned  plans  found  by  the  genetic  learning  system, 
and  to  suggest  possible  improvements. 

Samuel  consists  of  three  major  components: 
a  problem  specific  module,  a  performance  module, 
and  a  learning  module.  The  prc^lem  specific  module 
consists  of  the  task  environment  simulation,  or  world 
model,  and  its  interfaces.  The  performance  module 
consists  of  a  competition-based  p-oduction  system 
that  performs  matching,  conflict  resolution  and  credit 
assignment  The  learning  module  uses  a  genetic 
algorithm  to  develop  high  performance  reactive 
plans,  each  plan  expressed  as  a  set  of  condition-action 
rules.  Each  plan  is  evaluated  by  testing  its  perfor¬ 
mance  in  controlling  the  world  model  through  the 
performance  module.  Genetic  operators,  such  as 
crossover  and  mutation,  poduce  plausible  new  plans 
from  high  performance  precursors. 

Experiments  have  shown  that  Samuel  leams 
highly  effective  reactive  plans  for  laboratory  scale 
tactical  problems  (Grefenstette,  1989).  However, 
even  though  the  individual  rules  of  a  plan  can  be 
ifiteipreted,  the  strategy  underlying  the  plan  is  ufton 
not  apparent.  We  are  currently  expanding  our  focus 
to  include  the  derivation  of  explanations  of  Samuel’s 
reactive  rules.  These  explanations  are  expected  to 
clarify  the  system’s  performance  to  system  users  as 
well  as  to  generate  new  reactive  rules  for  Samuel. 


Explanations  of  Empirically  Derived  Reactive  Plans  199 


In  this  paper,  wc  first  discuss  a  simulated 
environment  to  which  Samuel  has  been  successfully 
applied.  The  remainder  of  the  paper  is  devoted  to 
describing  our  research  on  the  topic  of  generating 
explanations  of  reactive  plans. 

This  work  is  part  of  an  on-going  study  of 
genetic  algorithms  for  learning  tactical  plans.  The 
current  system  is  detailed  in  (Grefenstette,  Ramsey  & 
Schultz,  1990).  An  analysis  of  the  credit  assignment 
meCiods  in  appears  in  (Grefenstette,  1988).  A  study 
of  tie  effects  of  sensor  noise  on  appears  in  (Schultz, 
Rafiisey  &  Grefenstette,  1990). 

2  The  Evasive  Maneuvers  Problem 

We  have  tested  Samuel  initially  in  the  con¬ 
text  of  a  particular  task  called  Evasive  Maneuvers 
(EM),  inspired  in  part  by  (Erickson  and  Zytkow, 
1988).  In  the  EM  simulation,  there  are  two  objects  of 
interest,  a  plane  and  a  missile,  which  maneuver  in  a 
two-dimensional  world.  The  object  is  to  control  the 
turning  rate  of  the  plane  to  avoid  being  hit  by  the 
approaching  missile.  The  missile  tracks  the  motion 
of  the  plane  and  steers  toward  the  plane’s  anticipated 
position.  The  initial  speed  of  the  missile  is  greater 
than  that  of  the  plane,  but  the  missile  loses  speed  as  it 
maneuvers.  If  the  missile  speed  drops  below  some 
threshold,  it  loses  maneuverability  and  drops  out  of 
the  sky.  It  is  assumed  that  the  plane  is  more 
maneuverable  than  the  missile,  that  is,  the  plane  has  a 
smaller  turning  radius. 

There  exist  six  sensors  that  provide  informa¬ 
tion  about  the  current  tactical  state: 


1)  last-turn:  the  current  turning  rate  of  the  plane. 
This  sensor  can  assume  nine  values,  ranging  from 
-180  degrees  to  180  degrees  in  45  degree  increments. 

2)  time:  a  clock  that  indicates  time  since  detection  of 
the  missile.  Assumes  integer  values  between  0  and 
19. 

3)  range:  the  missile’s  current  distance  from  the 
plane.  Assumes  values  from  0  to  1500  in  increments 
of  100. 

4)  bearing:  the  direction  from  the  plane  to  the  mis¬ 
sile.  Assumes  integer  values  from  1  to  12.  The  bear¬ 
ing  is  expressed  in  “clock  terminology’’,  in  which  12 


o  clocK  denotes  dead  ahead  of  the  plane,  and  d 
o'clock  denotes  directly  behind  the  plane. 

5)  heading:  the  missile’s  direction  relative  to  the 
plane.  Assumes  values  from  0  to  350  in  increments 
of  10  degrees.  A  heading  of  0  indicates  that  the  mis¬ 
sile  is  aimed  directly  at  the  plane’s  current  position. 


whereas  a  heading  of  180  means  the  missile  is  aimed 
directly  away  from  the  plane. 

6)  speed:  the  missile’s  current  speed  measured  rela¬ 
tive  to  the  ground.  Assumes  values  firom  0  to  1000  in 
increments  of  50. 

In  addition  to  the  sensors,  there  is  one  control 
variable,  namely,  the  plane’s  turning-rate.  Turning- 
rate  has  nine  possible  values,  between  -180  and  180 
degrees  in  45  degree  increments.  The  learning  objec¬ 
tive  is  to  develop  a  set  of  decision  rules  that  map 
current  sensor  readings  into  actions  that  successfully 
evade  the  missile  whenever  possible.  The  rule  condi¬ 
tion  contains  sensor  ranges  (which  may  be  cyclic), 
and  the  action  specifies  a  setting  for  the  control  vari¬ 
able.  An  example  of  an  actual  decision  rule  learning 
by  Samuel  is  the  following: 

RULE  16: 

IF  (and  (last-turn  [-135, 135])  (time  [2. 12]) 
(range  [0, 700])  (bearing  [2, 6]) 

(heading  [0, 30])  (speed  [100, 950])) 

THEN  (turn  90) 

STRENGTH  949 

The  EM  process  is  divided  into  episodes  that 
begin  with  the  missile  approaching  the  plane  from  a 
randomly  chosen  direction  and  that  end  when  either 
the-  plane  is  hit  or  the  missile  velocity  falls  below  a 
given  threshold.  The  critic  module  provides  numeric 
feedback  at  the  end  of  each  episode  that  measures  the 
extent  to  which  the  missile  has  been  successfully 
evaded.  In  the  case  of  unsuccessful  evasion,  partial 
credit  is  given  reflecting  the  plane’s  survival  time 
(see  (Grefenstette  et.  al,  1990)).  Each  decision  rule  is 
assigned  a  numeric  strength  that  serves  as  a  predic¬ 
tion  of  the  rule’s  utility.  The  system  uses  incremental 
credit  assignment  methods  (Grefenstette,  1988)  to 
update  the  rule  strengths  based  on  feedback  from  the 
critic  received  at  the  end  of  the  episode.  Experiments 
have  shown  that  SAMUEL  can  learn  high-performance 
rule  sets  (plans)  for  this  task  (Grefenstette,  1989). 

As  can  be  seen  from  the  above  example,  while 
the  rules  are  individually  understandable,  the  underly¬ 
ing  strategy  behind  the  rules  is  not  usually  clear  from 
inspection.  On  the  other  hand,  a  person  who  watches 
a  display  of  the  EM  task  under  the  control  of  the 
learned  rules  can  usually  describe  the  strategy  being 
followed  in  conceptual  terms,  for  example: 

Get  the  missile  directly  behind  the  plane,  let  it 

get  fairly  close,  then  make  a  hard  left  turn. 

Once  such  a  description  has  been  obtained. 
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qualitative  reasoning  can  be  applied  to  explain  and 
justify  the  strategy.  It  is  expected  that  explanation- 
based  methods  will  help  to  explicate  the  higher-level 
strategies  being  learned,  making  the  results  of  the 
empirically  learning  more  easily  accepted  by  human 
operators  and,  ultimately,  expediting  the  learning  pro¬ 
cess  itself.  The  remainder  of  the  paper  offers  initial 
steps  in  this  direction. 

3  Explaining  Empirically  Derived  Rules 

Our  approach  to  applying  explanation-based 
techniques  to  reactive  plans  can  be  divided  into  four 
phases: 

(1)  inferring  plausible  subgoals; 

(2)  confirming  subgoal  satisfaction; 

(3)  creating  explanations  for  reactive  plans;  and 

(4)  deriving  new  rules. 

The  following  sections  elaborate  our  approach  to 
each  of  the  first  three  phases.  The  fourth  phase  is 
outlined  under  our  plans  for  future  research. 

3.1  Inferring  Plausible  Subgoals 

Prior  to  deriving  explanations  that  Samuel’s 
actions  arc  intended  to  satisfy  particular  subgoals,  the 
system  first  attempts  to  derive  plausible  subgoals, 
such  as  “increase  range  to  missile”  or  “increase  mis¬ 
sile  deceleration”  from  a  trace  of  the  behavior  of  the 
system  under  the  control  of  the  learned  rules.  A  trace 
covering  the  actions  occurring  over  a  single  episode 
is  examined.  Traces  consist  of  snapshots  of  sensor 
readings  followed  by  tlie  decision  rule  that  has  fired. 
Each  snapshot  is  associated  with  a  time,  or  state.  An 
example  of  a  tmee  is  shown  in  Figure  1,  where 
“Hum”,  “bmg”,  and  “hdng”  are  abbreviations  for 
last  turn,  bearing,  and  heading.  The  action  is  the  turn 
taken  by  the  plane  at  this  time.  In  order  to  simplify 
the  trace  shown  here,  the  decision  rules  do  not 
appear. 

A  domain  theory  has  been  developed  for 
automating  subgoal  derivation.  This  part  of  the 
domain  theory  consists  of  plausible  subgoal  deriva¬ 
tion  (PSD)  rules  such  as  the  following: 

T>Or>  1  •  TT7 

I  !•  XI  xaii^u^m/  ^ 

THEN  PLAUSIBLE-SUBGOAL 
(INCREASING  deceleration(ffi)) 

PSD  2:  IF  range(m)  <  RANGE2 

THEN  PLAUSIBLE-SUBGOAL 


hum 

time 

range 

bmg 

hdng 

speed 

action 

0 

0 

1000 

7 

0 

700 

0 

0 

1 

600 
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135 
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0 
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0 

3 
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45 

45 
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4 

20 
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90 
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45 
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7 
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45 

45 

9 
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8 

0 
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45 
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0 
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-90 

-90 

11 

600 

5 

0 

100 

45 

-45 

12 

700 

4 

0 

100 

45 

Fig.  1.  Example  execution  trace. 


(INCREASING  range(m)) 

where  RANGEl  and  RANGE2  are  user-definable 
parameters  and  m  represents  the  missile.  The  trace  is 
examined  to  find  the  first  time  at  which  a  PSD  rule 
precondition,  such  as  “range(m)  >  RANGEl”,  holds. 

The  algorithm  fw  finding  plausible  subgoals 
is  the  following: 

PSD  ALGORITHM:  Find  the  set  of  all  time  intervals 
in  the  execution  trace  of  an  episode  for  which  the 
sensor  values  satisfy  the  PSD  rule  condition  during 
that  interval.  This  seL  called  the  trigger  set,  consists 
of  situations  that  would  plausibly  trigger  the  imple¬ 
mentation  of  a  strategy  to  satisfy  the  subgoal 
specified  in  the  PSD  rule. 

In  the  example  trace  above,  if  RANGEl  were 
set  to  900,  then  there  is  one  time  interval  (of  length 
one  unit)  that  s 'tisfies  the  condition  for  PSDl.  This 
interval  is  [0,0]  therefore,  the  trigger  set  is  sim¬ 
ply  {  [0,0]  ).  J  PSDl  is  satisfied,  its  subgoal, 
namely,  “(INC.  \SING  deceleration(OT))”,  is  pro¬ 
posed  as  a  candidate  subgoal. 

Once  a  plausible  subgoal  is  found,  the  next 
task  is  to  determine  whether  the  subgcal  has  been 
satisfied.  Satisfaction  is  determined  by  applying  the 
confirmation  procedure  described  in  the  next  section 
for  time  intervals  in  the  trigger  set  until  either  the  set 
of  intervals  is  exhausted  or  the  subgoal  has  been 
confirmed. 
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3.2  Confirming  Subgoal  Satisfaction 

Subgoal  satisfaction  is  determined  by  once 
again  scanning  the  execution  trace.  Scanning  begins 
at  the  time  in  the  trace  following  a  time  interval  from 
the  trigger  set  Subgoal  confirmation  requires  an 
additional  domain  theory.  In  this  case,  Samuel’s 
decision  rule  language  is  extended  to  capture  further 
information  from  the  trace.  For  example,  the  system 
extracts  from  the  trace  information  about  the  change 
in  sensor  values  over  time.  The  speed  or  range  of  the 
missile,  fw  instance,  may  increase  from  one  state  to 
the  next  By  scantling  the  trace  over  multiple  states, 
the  system  derives  acceleration  and  range  increase 
information  for  confirming  subgoal  satisittCtioo 

The  confiimatioii  of  subgoal  satisfaction 
begins  when  a  time  interval  is  chosen  from  the  trigger 
set  In  the  current  implementation,  the  user  defines  a 
window  over  which  the  subgoal  satisfaction  check  is 
executed.  The  window  begins  at  a  user-defined  time 
that  is  after  the  trigger  set  time  interval.  Continuing 
with  the  example  above,  suppose  the  system  must 
confirm  that  the  increasing  missile  deceleration  goal 
has  been  achieved  over  the  time  window  that  extends 
from  time  1  to  time  3.  Then  the  change  in  missile 
speed  over  this  interval  is  checked  to  be  certain  that 
missile  deceleration  is  increasing.  The  deceleration  is 
increasing  from  100  to  150  over  this  time  interval. 
Therefore,  subgoal  satisfaction  has  been  confirmed. 

Once  subgoals  have  been  derived  and 
confirmed,  explanations  may  be  generated  to  justify 
the  observed  behavior.  The  next  section  describes  the 
process  of  explanation  generation. 

33  Creating  Explanations 

After  deriving  plausible  subgoals  and 
confirming  that  they  are  satisfied,  explanations  may 
be  formed  which  prove  that  sequences  of  Samuel’s 
decision  rules  satisfy  the  subgoals.  Explaining  failure 
to  satisfy  subgoals  is  presented  as  future  work. 

Creating  justifications  for  successful  subgoal 
satisfaction  requires  the  development  of  a  domain 
theory  that  captures  important  results  of  particular 
actions.  We  are  adapting  Forbus’s  Qualitative  Process 
Theory  (Forbus,  1984)  for  the  interpretation  of  the 
empirically  derived  rules  similarly  to  the  way  this 
theory  is  adapted  in  (Gervasio,  1989).  Qualitative 
Process  Theory  (QP  Theory)  expresses  common 
sense  notions  about  qualitative  relationships  between 
objects. 


We  are  currently  using  QP  Theory  to  define 
processes  relevant  to  EM.  A  process  is  defined  in 
(Forbus,  1984)  as  something  that  acts  through  time  to 
change  the  parameters  of  objects  in  a  situation. 
Example  processes  arc  fluid  and  heat  flow,  boiling, 
and  motion.  We  define  an  EM  process  below.  The 
individuals  are  the  objects  on  which  the  process  acts. 
The  quantity  conditions  are  inequalities  regarding  the 
quantities  of  individuals  that  can  be  predicted  solely 
within  dynamics.  Preconditions  are  conditions  that 
must  hold  during  the  process  but  which  need  not  be 
predictable  using  dynamics.  Relations  are  statements 
that  are  true  during  the  process.  A  process  is  active 
whenever  its  preconuiuons  and  quantity  conditions 
hold.  The  Q+/Q-  relations  define  qualitative  propor¬ 
tionalities.  (Q+  X,  Y)  means  that  parameter  X  is 
directly  proportional  to  parameter  Y.  (Q-  X,  Y)  means 
that  X  and  Y  arc  inversely  proportional. 

process  missile-evasion  (p,  m) 

Individuals: 
p,  a  plane 
m,  a  missile 

Quantity  Conditions: 
speed(p)>0 
sp€ed(m)  >  0 

Preconditions: 
range(m)  >  0 

Relations: 

(Q+  deceleration(w),  tuming-rate(ffj)) 

(Q+  tuming-rate(m),  tuming-rate(p)) 

(Q-  specd(p),  tuming-rate(p)) 

(Q-  tuming-rate(m),  range(m)) 

The  above  process  description  is  incomplete 
and  is  not  entirely  accurate.  Since  we  do  not  intend  to 
engineer  a  complete  and  perfect  domain  theory,  our 
system  will  eventually  possess  a  capability  to  diag¬ 
nose  errors  in  its  domain  theory. 

Once  a  partial  domain  theory  exists,  it  is  pos¬ 
sible  to  create  plausible  explanations  of  the  events 
that  occurred  during  an  EM  episode.  Explanations 
are  derived  by  creating  proofs  using  the  process  rela¬ 
tions  similarly  to  (Gervasio,  1989).  The  proof  begins 
with  an  observable  but  noncontrollable  subgoal  and 
terminates  when  a  change  in  a  conbollable  parameter 
has  been  found  that  is  believed  to  have  caused 
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subgoal  satisfaction.  The  body  of  the  proof  consists 
of  QP  Theory  relational  rules,  such  as  those  presented 
above.  For  example,  the  following  proof  explains 
how  the  increasing  turning-rate  of  the  plane  eventu¬ 
ally  causes  the  missile  deceleration  to  increase. 

(EXPLANATION 
(INCREASING  deceleration(m)) 

((Q+  deceleration(m)  tuming-rate(m)) 

(Q+  tuming-rate(m)  tuming-rate(p)) 
(INCREASING  tuming-rate(p)))) 


The  above  proof  has  terminated  with  a  state¬ 
ment  that  the  plane  turning  rate  is  increasing.  (The 
plane  turning  rate  is  currently  the  only  controllable 
parameter.)  The  increasing  turning  rate  is 
hypothesized  as  having  initiated  a  strategy  to  achieve 
subgoal  satisfaction.  The  system  next  verifies  (by 
examining  the  execution  trace)  that  this  behavior  has, 
in  fact,  occurred.  For  the  above  example,  this  would 
consist  of  a  check  to  be  certain  that  the  plane  turning 
rate  is  increasing  diuing  the  time  period  that  begins 
during  the  trigger  set  time  interval  and  ends  at  some 
user-specified  time  following  this  interval.  In  the 
example  trace  above,  the  condition  that  the  turning 
rate  must  be  increasing  would  be  satisfied  if  the 
plane’s  actions  were  examined  from  time  0  to  time  1. 

The  selection  of  times  for  checking  both 
subgoal  satisfaction  and  triggering  behaviors  is 
currently  done  by  the  user.  Thc.se  are  important 
parameters,  yet  they  arc  difficult  to  choose.  We  next 
describe  our  plans  for  future  work.  These  plans 
include  automating  the  choice  of  these  parameters,  as 
well  as  other  parts  of  the  system. 


4  Future  Work 


There  arc  a  few  important  directions  that  we 
plan  to  pursue.  The  first  direction  consists  of  ordering 
explanations  according  to  their  degree  of  plausibility, 
'fhe  second  direction  consists  of  using  the  explana¬ 
tions  to  generate  new  decision  rules  for  Samuel. 
Third,  we  plan  to  automate  the  generation  of  system 
parameters  and  rules.  The  fourth  future  direction 
consists  of  diagnosing  failures.  Finally,  we  would 
like  to  increase  the  complexity  of  the  EM  problem. 


determine  the  differences  in  the  degree  of  plausibility 
of  various  explanations.  The  manner  in  which  this  is 
being  done  is  by  generating  explanations  from  multi¬ 
ple  episode  traces.  From  our  experiences  with 


explanation  generation,  we  have  been  observing  that 
some  explanations/subgoals  arc  considered  plausible 
more  frequently  than  others.  We  plan  to  use  this 
information  ab^t  the  frequency  to  order  the  PSD 
rules  in  a  manner  that  rcflects  the  plausibility  of 
explanations,  e.g.,  more  plausible  subgoals  arc  tried 
first. 

The  second  direction  for  future  research  con¬ 
sists  of  generating  new  decision  rules  from  the  expla¬ 
nations.  If  a  subgoal  is  satisfied,  and  an  explanation 
is  generated  for  subgoal  satisfaction,  then  the  system 
can  generalize  the  explanation  (perhaps  using  the 
explanation-based  learning  methods  of  (Mitchell, 
Keller  &  Kedar-Cabelli,  1986))  and  then  use  the  gen¬ 
eralized  explanation  to  generate  new  decision  rules. 
Given  a  successful  explanation,  Samuel’s  perfor¬ 
mance  can  benefit  by  the  creation  of  new  decision 
rules  that  are  expected  to  achieve  the  same  results  as 
the  rules  from  which  the  explanation  is  formed.  The 
process  of  generating  decision  rules  from  generalized 
explanations  is  one  of  rule  specialization.  We  are 
currently  considering  using  ideas  from  MARVIN 
(Sammut  and  Banerji,  1986)  for  designing  the  rule 
specialization  process.  Once  new  decision  rules 
have  been  created,  they  can  be  fed  back  into 
Samuel’s  performance  module  to  augment  the  exist¬ 
ing  rule  sets.  These  modified  rule  sets  may  then  be 
empirically  evaluated  using  the  EM  simulator. 

The  third  direction  planned  for  our  research  is 
the  automation  of  certain  portions  of  the  system  that 
are  currently  provided  by  the  user.  For  example,  sys¬ 
tem  parameters,  such  as  the  user-input  window  size 
for  subgoal  confirmation,  might  be  empirically  deter¬ 
mined.  Furthermore,  the  domain  theory  might  also 
be  derived  empirically.  For  instance,  the  Q+/-  rela¬ 
tionships  in  the  domain  theory  for  explanations  could 
be  extracted  from  the  execution  traces. 

Although  we  have  been  able  to  generate 
explanations  for  successful  subgoal  satisfaction,  a 
ripe  area  for  future  research  is  the  addition  of  the  abil¬ 
ity  to  handle  failures.  If  the  system  derives  an  expla¬ 
nation  that  the  reactive  rules  are  intended  to  achieve  a 
particular  subgoal,  but  the  trace  does  not  verify  that 
the  subgoal  has  been  satisfied,  then  there  exist  four 
possible  cases: 

(1)  The  chosen  explanation  is  incorrect,  but  the 
domain  theory  is  not  faulty 

(2)  The  plausible  subgoal  that  is  inferred  is  not  actu¬ 
ally  the  subgoal  that  the  system  is  trying  to  achieve 

(3)  The  reactive  rules  are  intended  to  achieve  a 
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subgoal,  but  the  system  has  encountered  some  unex¬ 
pected  interference 

(4)  The  domain  theory  is  incorrector  incomplete 

Although  the  generation  of  alternative  explanations 
would  be  a  relatively  simple  a  solution  for  the  first 
case,  the  other  cases  would  require  more  sophisti¬ 
cated  error  diagnosis. 

A  final  direction  for  future  research  is  to 
increase  the  complexity  of  the  EM  problem.  For 
example,  the  only  controllable  parameter  currently 
implemented  is  the  plane  turning  rate.  More  controll¬ 
able  parameters  might  be  added.  Furthermore,  the 
problem  difficulty  would  be  greatly  increased  if  the 
number  of  missiles  were  increased.  Ultimately,  we 
would  like  Samuel  to  be  able  to  handle  realistic 
problems. 

5  Summary 

Progress  in  generating  and  using  explanations 
of  reactive  plans  for  Samuel  is  expected  to  provide 
an  important  step  toward  reducing  the  burden  placed 
on  the  system’s  empirical  learning  mechanisms.  The 
eventual  goal  of  our  research  is  to  use  these  explana¬ 
tions  to  create  high  performance  reactive  plans. 
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Abstract 

One  of  the  basic  assumptions  of  work  in  machine 
learning  is  that  the  environments  in  which  our  learn¬ 
ing  systems  are  situated  are  stable  enough  to  make 
learning  useful.  While  this  assumption  is  warranted 
in  most  domains,  much  of  what  we  tend  to  think  of 
as  a  natural  regularity  in  the  world  has  often  been  ar¬ 
tificially  imposed  in  order  to  make  both  learning  and 
planning  more  tractable.  This  imposition  is  usually 
the  product  of  a  long-term  manipulation  of  the  phys¬ 
ical  structure,  goals,  and  operators  of  these  domains 
in  the  direction  of  maximal  utility.  Often,  however, 
it  is  the  product  of  an  agent  imposing  stability  on  a 
domain  in  an  effort  to  increase  the  utility  of  his  own 
planning  and  learning.  This  paper  examines  the  idea 
of  wliat  it  would  mean  for  an  agent  to  strategically 
impose  order  on  a  domain  in  an  effort  to  increase  the 
effectiveness  of  its  own  learning.  In  particular,  it  out¬ 
lines  an  initial  taxonomy  of  classes  of  stability  and 
presents  the  strategies  for  increasing  overall  stability 
that  are  associated  with  each  class.  Finally,  it  outlines 
the  basic  learning  and  planning  trade-offs  that  have  to 
be  made  when  stability  is  optimized. 

1  Stability,  change,  and 
enforcement 

The  world  is  in  flux.  Every  moment  brings  a  new  set 
of  states  into  being  and  removes  an  old  set  from  exis¬ 
tence.  Every  action,  every  plan  we  know,  introduces 
change  in  the  form  of  the  goals  we  are  trying  to  achieve 
and  the  side-effects  of  the  actions  that  we  are  using  to 
achieve  them. 

The  world  is  also  stable.  Facts  persist  over  time. 
Objects  tend  to  stay  where  they  are  placed,  actions 
tend  to  have  the  same  results,  and  the  basic  physics  of 
our  environment  seems  to  remain  fixed  and  unchang¬ 
ing.  Even  our  goals  tend  to  stay  constant  over  time. 

The  possibility  of  change  allows  us  to  act  at  all. 
The  stability  of  the  world,  however,  allows  us  to  act 
intelligently.  It  is  our  trust  in  the  stability  of  the  world 
that  allows  us  to  predict  the  future  based  on  the  past 
and  build  plans  based  on  experience. 

There  is  a  direct  relationship  between  the  overall 
stability  of  an  environment  and  our  ability  to  predict 


and  plan  within  it.  The  greater  the  stability  the  more 
certain  our  predictions,  the  more  powerful  our  plans. 

As  both  individuals  and  as  societies,  we  respond  to 
this  by  trying  to  increase  the  stability  of  our  world. 
We  segment  our  schedules  of  work,  play  and  relaxation 
so  that  each  day  will  tend  to  look  very  much  like  the 
last.  We  organize  our  homes  and  workspaces  so  that 
objects  will  be  in  predictable  places.  We  even  organize 
our  habits  so  that  particular  conjuncts  of  goals  will 
tend  to  arise  together.  In  all  aspects  of  our  lives,  we 
make  moves  to  stabilize  our  different  worlds. 

Of  course,  all  this  is  done  for  a  reason.  Schedules 
that  remain  constant  over  time  improve  predictability 
and  provide  fixed  points  that  reduce  the  complexity 
of  projection.  Few  of  us  need  to  reason  hard  about 
where  we  will  be  from  9  to  5  because  we  have  stabi¬ 
lized  our  schedules  with  respect  to  those  hours.  Fixed 
locations  for  objects  reduce  the  need  for  inference  and 
enable  the  execution  of  plans  tuned  to  particular  en¬ 
vironments.  If  your  drinking  glasses  are  all  kept  in 
one  cupboard,  you  can  get  a  drink  of  water  without 
ever  considering  the  real  precondition  to  the  plan  that 
they  are  in  there  now.  Likewise,  the  clustering  of  goals 
into  standard  conjuncts  enables  the  automatic  use  of 
plans  that  are  optimized  for  those  conjuncts.  A  morn¬ 
ing  routine  is  exactly  that,  a  routine  that  is  designed 
to  fit  a  conjunct  of  goals  that  can  all  be  satisfied  with 
a  well  tuned  plan.  In  general,  we  force  stability  on  the 
world,  and  then  enforce  it,  in  an  effort  to  improve  our 
ability  to  function  in  it. 

In  this  paper,  we  outline  this  concept  of  enforce¬ 
ment  and  discuss  the  different  forms  that  it  takes.  We 
examine  the  idea  of  what  it  would  mean  for  an  agent  to 
strategically  impose  order  on  a  domain  in  an  effort  to 
increase  the  effectiveness  of  its  own  learning.  In  par¬ 
ticular,  we  outline  a  basic  taxonomy  of  classes  of  sta¬ 
bility  and  presents  the  strategies  for  increasing  overall 
stability  that  are  associated  with  each  class.  We  ex¬ 
amine  its  relationship  to  learning  and  argue  that  both 
learning  and  enforcement  are  strategies  for  building  up 
a  correspondence  between  an  agent’s  mental  model  of 
the  world  and  the  actual  physical  reality.  We  also  dis¬ 
cuss  the  learning  and  planning  trade-offs  that  have  to 
be  made  when  stability  is  optimized. 
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2  Planning,  learning  and 
enforcement 

Over  the  past  five  years,  research  in  planning  has 
taken  a  dramatic  change  in  course.  Planning  re¬ 
searchers  have  begun  to  acknowledge  that  the  world 
is  far  too  complex  and  uncertain  to  allow  a  planner  to 
exhaustively  plan  for  a  set  of  goals  prior  to  execution 
(Chapman,  1985).  More  and  more,  the  study  of  plan¬ 
ning  is  being  cast  as  the  broader  study  of  planning, 
action,  learning,  and  understanding  (Agre  and  Chap¬ 
man,  1987,  Alterman,  1985,  and  Hammond,  1989). 

The  particular  cast  of  this  relationship  that  we 
have  been  studying  is  a  view  of  planning  as  embedded 
within  a  memory-based  understanding  system  (Mar¬ 
tin,  1990)  connected  to  the  environment  (Hammond 
and  Converse,  1990).  The  power  of  this  approach 
lies  in  the  fact  that  it  allows  us  to  view  the  planner’s 
environment  as  well  as  its  plan  selections,  decisions, 
conflict  resolutions,  and  action  mediation  through  the 
single  eye  of  situation  assessment  and  response.  We 
see  this  integration  of  planning,  understanding,  and 
action  as  a  model  of  agency,  in  that  we  are  attempt¬ 
ing  to  capture  an  architecture  for  an  agent  embedded 
in  an  environment  rather  than  simply  a  planner  ab¬ 
stracted  away  from  an  external  world. 

This  integration  of  planning  with  understanding,  in 
particular  understanding  through  the  use  of  episodic 
memory,  also  provides  us  with  a  powerful  tool  with 
which  to  deal  with  the  problem  of  learning  from  both 
planning  and  execution.  One  aspect  of  this  view  of 
agency  is  that  it  treats  planning  as  a  long-term  prob¬ 
lem  that  continues  over  time.  Rather  than  seeing  the 
problems  of  planning  and  action  in  terms  of  single  in¬ 
stances  of  goals  and  their  related  plans,  we  see  plan¬ 
ning  as  also  involving  the  ongoing  process  of  finding 
the  set  of  plans  and  plan  modifications  that  are  most 
useful  within  the  planner’s  domain.  For  example,  the 
overall  cost  of  constructing  a  plan  can  be  thought  of 
as  amortized  over  its  repeated  reuse,  but  only  if  we 
think  of  planning  as  the  creation  of  structures  that 
actually  will  be  saved  and  reused  (Marks,  Hammond, 
and  Converse  1989). 

Part  of  this  process  involves  the  standard  issues  of 
learning.  This  includes  learning  particular  plans,  the 
features  that  predict  their  usefulness,  and  the  con¬ 
ditions  under  which  they  should  be  avoided.  Here, 
the  overall  goal  is  to  develop  an  internal  model  of  the 
plans  and  inferences  that  are  functional  in  the  actual 
world  by  adapting  the  internal  world  to  match  the  ex¬ 
ternal  reality.  Much  of  our  work  to  date  (Hammond, 
Converse,  and  Marks,  1988  and  Hammond,  1989)  has 
been  aimed  at  this  sort  of  learning  in  the  context  of 
planning  and  execution.  In  particular,  we  have  been 
concerned  with  learning  optimized  plans  for  the  recur¬ 
ring  conjuncts  of  goals  in  a  domain  as  well  as  those 
that  avoid  the  typical  problems  that  will  tend  to  arise 
out  of  a  problem  space. 

The  idea  of  enforcement  is  a  somewhat  different 


approach  to  the  goal  of  building  this  functional  corre¬ 
spondence  between  an  agent’s  internal  state  and  the 
external  world.  The  difference  between  enforcement 
and  previous  approaches  to  learning  from  planning  lies 
ip  its  use  of  techniques  to  shape  and  stabilize  an  envi¬ 
ronment  in  an  effort  to  optimize  the  overall  utility  of 
plans  that  already  exist  or  that  have  just  recently  been 
produced.  The  goal  associated  with  these  techniques 
is  the  same  as  that  associated  with  learning  in  the  con¬ 
text  of  planning — the  development  of  a  set  of  effective 
plans  that  can  be  applied  to  satisfy  the  agent’s  goals. 
The  path  toward  this  goal,  however,  is  one  of  shaping 
the  world  to  fit  the  agent’s  plans  rather  than  shaping 
the  agent  to  fit  the  world.  The  idea  of  enforcement, 
then,  rises  out  of  the  observation  that  the  result  of  a 
long-term  interaction  between  agent  and  environment 
includes  an  adaptation  of  the  environment  as  well  as 
an  adaptation  of  the  agent. 

3  Opportunism  and 

enforcement:  An  example 

Our  notion  of  enforcement  rises  out  of  our  approach 
to  planning  as  only  part  the  study  of  agency — most 
specifically — out  of  our  examination  of  opportunistic 
memory,  in  the  TRUCKER  and  RUNNER  projects 
(Hammond  1989).  In  this  work,  we  looked  at  the  is¬ 
sues  involved  with  indexing  blocked  goals  in  memory 
and  reawakening  them  under  conditions  which  wov.ld 
favor  their  satisfaction.  The  idea  was  to  combine 
planning-time  reasoning  with  execution-time  under¬ 
standing  in  an  effort  to  obtain  efficient  recognition  of 
and  capitalization  of  execution-time  opportunism. 

One  of  the  examples  that  we  examined  involves  an 
agent  going  to  the  grocery  store  to  pick  up  a  quart  of 
orange  juice  and  recalling  that  he  needs  milk  as  well. 
We  argue  that  there  are  two  aspects  to  how  an  agent 
should  repond  to  this  sort  of  problem.  First,  he  should 
attempt  to  incorporate  the  plans  for  the  recalled  goal 
into  the  current  execution  agenda.  Second,  he  should 
reason  about  the  likelihood  of  the  recalled  goal  re¬ 
curring  in  conjunction  with  the  goal  tht.'.,  was  already 
being  acted  upon  and  save  the  plan  for  the  conjunct 
of  the  two  goals  if  they  were  likely  to  be  conjoined  in 
the  future.  In  a  sense,  these  two  steps  correspond  to 
fixing  the  plan  and  then  fixing  the  planner. 

One  element  of  this  process  that  interests  us  is  the 
notion  that  the  more  likely  it  is  that  that  goals  will 
show  up  in  conjunction  with  each  other,  the  more  use¬ 
ful  the  plan  will  be.  In  this  example,  the  utility  of 
saving  and  attempting  to  reuse  the  plan  to  buy  both 
the  orange  juice  and  the  milk  is  maximized  when  the 
two  goals  are  guaranteed  to  show  up  in  conjunction 
whenever  either  of  the  two  recurs.  This  suggests  the 
idea  that  one  of  the  steps  that  an  agent  could  take  in 
improving  the  utility  of  his  plans  would  be  to  force  the 
recurrence  of  the  conjuncts  of  goals  over  which  these 
plans  are  optimized.  In  terms  of  the  orange  juice  and 
milk  example,  this  means  making  sure  that  the  cycles 
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of  use  of  each  resource  are  synchronized.  This  can  be 
done  by  either  changing  the  actual  use  of  the  resources 
to  bring  them  into  synchronization  or  by  changing  the 
amounts  purchased  such  that  they  would  be  used  up 
at  the  same  time.  In  either  case,  the  idea  is  to  alter 
circumstances  in  the  world  such  that  the  long  term 
utility  of  a  plan  that  already  exists  is  optimized.  This 
is  done  by  stabilizing  the  world  with  regard  to  the  rela¬ 
tive  use  of  the  two  resources.  This  type  of  enforcement 
is  aimed  at  controlling  what  we  call  RESOURCE  cycle 
SYNCHRONIZATION  in  that  its  goal  is  to  stabilize  the 
use  cycles  of  multiple  resources  with  respect  to  one 
another. 

Adjusting  the  amount  of  orange  juice  purchased  so 
makes  cycle  of  use  match  the  cycle  of  use  of  the  milk. 
This  increases  the  utility  of  the  plan  to  buy  the  two 
together  in  three  ways:  optimization  of  planning,  op¬ 
timization  of  indexing,  and  optimization  of  execution. 

•  In  terms  of  planning  optimization,  the  agent  now 
has  available  a  plan  for  a  conjunct  of  goals  that 
he  knows  will  recur  so  he  never  needs  to  recreate 
it. 

This  means  never  having  to  reconstruct  the  GET- 
ORANGE-JUICE-AND-MILK  plan  again. 

•  And  in  terms  of  indexing  optimization,  the  plan 
can  be  indexed  by  each  of  the  elements  of  the 
conjunct— rather  than  by  the  conjunct  itself— 
thus  reducing  the  complexity  of  the  search  for 
the  plan  in  the  presence  of  the  individual  goals. 
This  means  that  the  plan  will  be  automatically 
suggested  when  either  the  HAVE-MILK  goal  or  the 
HAVE-ORANGE-JUICE  goal  arises  even  when  the 
other  element  of  the  goal  conjunct  does  not. 

•  In  terms  of  execution  optimization,  the  agent  can 
decide  to  commit  to  and  begin  execution  of  the 
new  plan  when  either  of  the  two  goals  arises. 
It  can  do  this  because  it  is  able  to  predict  that 
the  other  goal  is  also  present,  even  if  it  is  not 
explicitly  so. 

This  means  that  the  agent  can  begin  to  run  the 
GET-ORANGE-JUICE-AND-MILK  plan  when  he  no¬ 
tices  that  he  is  out  of  either  milk  or  orange  juice 
without  being  forced  to  verify  that  the  other  goal 
is  active.  In  some  sense,  the  agent  does  not  have 
to  check  the  refrigerator  to  see  if  he  is  out  of 
milk. 

One  way  of  viewing  enforcement  is  as  an  extension 
of  planning  itself.  As  in  planning,  the  conditions  that 
are  enforced  are  fixed  in  the  world  using  the  same 
sorts  of  actions  that  result  in  the  satisfaction  of  goals. 
The  difference  is  that  the  actions  associated  with  en¬ 
forcement  result  in  changes  to  the  actual  structure  of 
a  domain. 

Likewise,  enforcement  can  be  seen  as  an  active 
cousin  of  learning.  Just  as  learning  techniques  in  plan¬ 
ning  are  designed  to  build  up  an  effective  set  of  plans 
and  operators  for  a  domain,  enforcement  techniques 


are  designed  to  do  so  as  well.  The  difference  here  is 
that  learning  attempts  to  satisfy  this  goal  by  chang¬ 
ing  the  learner  and  enforcement  attempts  to  do  so  by 
changing  the  world. 

From  either  point  of  view,  the  notion  of  enforce¬ 
ment  is  straightforward.  For  any  plan  that  is  learned, 
its  utility  can  be  maximized  if  the  conditions  in  the 
world  that  favor  its  use  can  be  guaranteed.  If  the 
world  is  unstable  with  respect  to  those  conditions,  one 
step  that  an  agent  can  take  to  optimize  the  utility  of 
the  plan  is  to  enforce  that  stability  by  changing  some 
aspect  of  the  world  so  as  to  make  those  conditions  pre¬ 
vail  in  all  circumstances  under  which  the  plan  could 
be  run. 

4  Stability  and  enforcement 

While  RESOURCE  cycle  SYNCHRONIZATION  was  one 
of  the  first  instances  of  stability  we  encountered,  it  is 
by  no  means  the  only  kind.  In  our  preliminary  ex¬ 
amination,  we  have  uncovered  six  other  basic  types 
of  stability  and  related  enforcement  strategies.  Each 
type  of  stability,  when  enforced,  increases  the  utility 
of  existing  plans  and  planning  processes  with  respect 
to  the  cost  of  use,  the  cost  of  indexing,  the  cost  of 
projection  and/or  the  likely  applicability  of  the  plans 
that  have  been  stored. 

The  question  is,  is  it  possible  to  explicate  this  tax¬ 
onomy  of  stability  in  a  way  that  would  allow  a  system 
to  actually  recognize  and  enforce  the  different  types? 
The  sections  that  follow,  outline  this  taxonomy  with 
respect  to  this  question  by  breaking  each  type  down 
in  terms  of  the  following  issues: 

•  What  types  of  stability  are  useful  in  and 
of  themselves? 

•  Over  what  goals  do  they  allow  optimiza¬ 
tion? 

•  What  strategies  can  be  formed  to  enforce 
them? 

•  How  can  opportunities  to  apply  these  en¬ 
forcement  strategies  be  recognized? 

4.1  Stability  of  location 

The  most  common  type  of  stability  that  arises  in  ev¬ 
eryday  activity  is  that  of  location  of  commonly  used 
objects.  Our  drinking  glasses  end  up  in  the  same  place 
every  time  we  do  dishes.  Our  socks  are  always  to¬ 
gether  in  a  single  drawer.  Everything  has  a  place  and 
we  enforce  everything  ending  up  in  its  place. 

In  the  RUNNER  project,  we  have  already  begun 
to  see  the  utility  of  this  sort  of  stability  in  terms  of 
optimizing  the  reuse  of  specific  plans.  RUNNER  is 
functioning  in  a  breakfast  world  in  which  it  has  to 
make  a  pot  of  coffee  in  the  morning.  Stabilizing  the 
location  of  obj'  '.ts  such  as  the  coffee  pot,  the  beans, 
and  the  grinder  would  allow  it  to  simply  reuse  existing 
plans  with  minimal  modification.  It  also  reduces  the 
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need  for  search  for  the  objects  in  both  the  knowledge¬ 
base  and  physical  sense  of  the  word.  Of  course,  the 
way  to  enforce  this  sort  of  stability  is  to  alter  plans 
that  make  use  of  these  objects  so  that  they  end  up 
placing  them  back  where  they  “belong” . 

Enforcing  STABILITY  OF  LOCATION,  then,  serves  to 
optimize  a  wide  range  of  processing  goals.  First  of 
all,  the  fact  that  an  often  used  object  or  tool  is  in 
a  set  location  reduces  the  need  for  any  inference  or 
projection  concerning  the  effects  of  standard  plans  on 
the  objects  or  the  current  locations  of  objects.  Second, 
it  allows  plans  that  rely  on  the  objects  locations  to  be 
run  without  explicit  checks  (e.g.,  no  need  to  explicitly 
determine  that  the  glasses  are  in  the  cupboard  before 
opening  it).  Third,  it  removes  the  need  at  execution¬ 
time  for  a  literal  search  for  the  object. 

As  we  mentioned  earlier.  The  enforcement  of  this 
sort  of  stability  requires  altering  plans  that  make  use 
of  these  objects  so  that  they  end  up  placing  them  back 
where  they  “belong” . 

The  final  question  in  terms  of  stability  of  loca¬ 
tion,  then,  is  the  issue  of  when  to  attempt  enforce¬ 
ment.  As  in  many  instances  of  standard  learning,  fail¬ 
ure  is  a  good  indicator.  Here,  the  problem  will  take 
the  form  of  an  execution-time  failure  to  actually  find 
an  object  that  is  both  known  to  exist  and  is  a  object 
essential  to  a  plan  being  run.  Of  course,  if  many  plans 
make  use  of  the  object  and  each  prefers  it  in  a  differ 
ent  location,  then  it  will  not  be  useful  to  attempt  to 
enforce  its  location. 

4.2  Stability  of  schedule 

Another  common  form  of  stability  involves  the  con¬ 
struction  of  standard  schedules  that  persist  over  time. 
Eating  dinner  at  the  same  time  every  day  or  having 
preset  meetings  that  remain  stable  over  time  are  two 
examples  of  this  sort  of  stability.  The  main  advantage 
of  this  sort  of  stability  is  that  it  allows  for  very  ef¬ 
fective  projection  in  that  it  provides  fixed  points  that 
do  not  have  to  be  reasoned  about.  In  effect,  the  fixed 
nature  of  certain  parts  of  an  overall  schedule  reduces 
that  size  of  the  problem  space  that  has  to  be  searched. 

A  second  advantage  it  that  fixed  schedules  actually 
allow  greater  optimization  of  the  plans  that  are  run 
within  the  confines  of  the  stable  parts  of  the  sched¬ 
ule.  Features  of  a  plan  that  are  linked  to  time  can  be 
removed  from  consideration  if  the  plan  is  itself  fixed 
in  time.  For  example,  by  going  into  work  each  day  at 
8.30,  an  agent  might  be  able  to  make  use  of  the  traffic 
report  that  is  on  the  radio  at  the  half-hour.  Because 
the  schedule  is  stable,  however,  he  duesn  t  have  to  ac¬ 
tually  reason  about  this  as  an  explicit  condition  of  the 
plan. 

Finally,  if  the  schedule  is  stabilized  with  regard  to 
a  pre-existing  norm,  (e.g.,  always  have  lunch  at  noon) 
coordination  between  agents  is  also  facilitated. 

The  enforcement  strategy  associated  with  STABli.- 
ITY  OF  SCHEDULE  is  simple:  don’t  break  the  schedule. 
Here,  of  course,  we  see  the  first  instance  of  a  trade-off 


between  enforcement  and  planning  flexibility.  While 
an  enforced  schedule  allows  for  optimization  of  search 
and  execution  for  recurring  goals,  it  often  reduces  the 
flexibility  required  to  incorporate  new  goals  into  the 
preset  agenda.  As  with  any  heuristic  that  reduces  the 
combinatorics  of  a  search  space,  there  will  be  times 
when  an  optimal  plan  is  not  considered. 

It  is  important  to  realize  that  the  schedule  enforced 
is  optimized  over  the  goals  that  actually  do  tend  to 
recur.  Thus,  an  agent  who  is  enforcing  this  sort  of 
stability  is  able  to  deal  with  regularly  occurring  events 
with  far  greater  ease  than  when  it  is  forced  to  deal 
with  goals  and  plans  outside  of  its  normal  agenda. 
This  sort  of  trade-off  in  which  commonly  occurring 
problems  are  easier  to  solve  than  less  common  ones 
seems  to  be  an  essential  by-product  of  stabilizing  an 
environment. 

Recognition  of  opportunities  to  enforce  STABILITY 
OF  SCHEDULE  is  a  fairly  difficult  problem.  There  are 
two  basic  features  that  are  important.  First,  an  agent 
must  recognize  that  a  goal  is  going  to  recur  and  that 
a  single  plan  is  designed  to  satisfy  it.  Second,  he  must 
recognize  that  a  particular  placement  in  time  is  op¬ 
timal  for  running  the  plan.  While  the  first  of  these 
is  fmrly  simple,  the  second  feature  requires  either  an 
extensive  projection  over  different  times  for  running 
the  plan  or  an  opportunistic  realization  that  the  plan 
has  been  run  at  a  particularly  good  time  at  one  point. 

4.3  Stability  of  satisfaction 

Another  type  of  stability  that  an  agent  can  enforce 
is  that  of  the  goals  that  he  tends  to  satisfy  in  con¬ 
junction  with  each  other.  For  example,  people  living 
in  apartment  buildings  tend  to  check  their  mail  on 
the  way  into  their  apartments.  Likewise,  many  peo¬ 
ple  will  stop  at  a  grocery  store  on  the  way  home  from 
work.  In  general,  people  develop  habits  that  cluster 
goals  together  into  compact  plans,  even  if  the  goals  are 
themselves  unrelated.  The  reason  that  the  plans  are 
together  is  more  a  product  of  the  conditions  associated 
with  running  the  plans  than  the  goals  themselves. 

An  important  feature  of  this  sort  of  stability  is  that 
the  goals  are  recurring  and  that  the  plan  associated 
with  the  conjunct  is  optimized  with  respect  to  them. 
Further,  the  goals  themselves  must  be  on  loose  cycles 
and  robust  with  regard  to  over-satisfaction. 

The  advantage  of  this  sort  of  stability  of  satis¬ 
faction  is  that  an  optimal  plan  can  be  used  that  is 
already  tuned  for  the  interactions  between  individual 
plan  steps.  Second,  it  can  be  run  habitually,  with¬ 
out  regard  to  the  actual  presence  of  the  goals  them¬ 
selves.  As  in  the  case  of  STABILITY  OF  LOCATION  in 
which  a  plan  can  be  run  without  explicit  checks  on 
the  locations  of  objects,  stability  of  satisfaction 
allows  for  the  execution  of  plans  aimed  at  satisfying 
particular  goals,  even  when  the  goals  are  not  explicitly 
checked. 

The  way  to  enforce  this  sort  of  stability  is  to  as¬ 
sociate  the  plan  with  a  single  cue — either  a  goal  or 
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a  feature  in  the  world — and  begin  execution  of  that 
plan  whenever  the  cue  arises.  In  this  way,  the  habit¬ 
ual  activity  can  be  started  even  when  all  of  the  goals 
that  it  satisfies  are  not  present. 

The  circumstances  that  suggest  habits  that  will  en¬ 
force  STABILITY  OF  SATISFACTION  are  simple.  When 
an  opportunity  to  satisfy  a  suspended  goal  (Ham¬ 
mond,  1989)  is  encountered,  the  collection  of  goals 
currently  being  satisfied  become  candidates  for  con¬ 
junction  into  a  habit.  The  individual  cycles  of  the 
goals  then  determine  whether  or  not  they  are  joined 
into  a  single  plan. 

4.4  Stability  of  plan  use 

We  often  find  ourselves  using  familiar  plans  to  satisfy 
goals  even  in  the  face  of  wide  ranging  possibilities. 
For  example,  when  I  travel  for  conferences,  I  tend  to 
schedule  my  flight  in  to  a  place  as  late  as  I  can  and 
plan  to  leave  as  late  as  I  can  on  the  last  day.  This  opti¬ 
mizes  my  time  at  home  and  at  the  conference.  It  also 
allows  me  to  plan  without  knowing  anything  about 
the  details  of  the  conference  schedule.  As  a  result,  I 
have  a  standard  plan  that  I  can  run  in  a  wide  range  of 
situations  without  actually  planning  for  them  in  any 
detail.  It  works,  because  it  already  deals  with  the  ma¬ 
jor  problems  (missing  classes  at  home  and  important 
talks  at  the  conference)  as  part  of  its  structure. 

The  major  advantage  here  in  enforcing  the  stabil¬ 
ity  OF  PLAN  USE  is  that  the  plan  that  is  used  is  tuned 
to  avoid  the  typical  interactions  that  tend  to  come  up. 
This  means,  of  course,  that  the  plans  used  in  this  way 
must  either  be  the  result  of  deep  projec  )n  over  the 
possible  problems  that  can  come  up  in  a  domain  or 
be  constructed  incrementally.  A  further  advantage  is 
that  little  search  through  the  space  of  possible  plans 
for  a  set  of  goals  needs  to  be  done  in  that  one  plan  is 
always  selected. 

The  enforcement  here  is  simply  a  product  of  choos¬ 
ing  to  stay  with  a  single  plan  in  the  face  of  a  large 
space  of  possibilities.  In  a  sense,  this  is  the  idea  of 
having  a  standard  operating  procedure  for  a  set  of 
goals. 

This  sort  of  stabilization  is  the  most  emergent  of 
the  different  types  of  stability  and  enforcement  that 
we  have  looked  at.  This  is  because  these  sorts  of  plans 
are  the  product  of  incremental  debugging  over  time. 
The  drive  towards  selection  of  plans  that  have  worked 
in  the  past  is  simply  part  of  the  overall  case-based 
approach.  Of  course,  by  starting  each  new  experience 
with  a  set  of  goals  with  a  plan  that  is  debugged  with 
respect  a  set  of  already  experienced  problems  and  then 
debugging  it  with  respect  to  new  problems,  an  agent  is 
actually  constructing  the  plan  that  will  be  optimized 
for  the  entire  set  of  interactions  that  he  will  encounter. 

4.5  Policy 

Everyone  always  carries  money.  This  is  because  we 
always  need  it  for  a  wide  variety  of  specific  plans.  Of 


course,  the  policy  decision  to  always  have  money  on 
hand  is  itself  a  form  of  enforcement. 

The  interesting  issue  here  is  that  there  are  partic¬ 
ular  states  in  the  world  that  act  as  preconditions  to 
a  wide  variety  of  plans.  By  enforcing  these  states,  we 
can  establish  a  POLICY  that  will  allow  us  to  run  these 
plans  without  explicit  regard  to  the  state  itself.  Here 
ageiin,  the  advantage  is  that  plans  can  be  run  without 
explicit  reference  to  many  of  the  conditions  that  must 
obtain  for  them  to  be  successful.  An  agent  can  actu¬ 
ally  assume  conditions  hold,  because  he  has  a  POLICY 
that  makes  them  hold. 

Enforcement  of  policy  requires  the  generation  of 
specific  goals  to  satisfy  the  policy  state  whenever  it  is 
violated.  In  terms  of  policies  such  as  always  having 
money  on  hand,  this  means  that  the  lack  of  cash  on 
hand  will  force  the  generation  of  a  goal  to  have  cash, 
even  when  no  specific  plan  that  will  use  that  cash  is 
present. 

The  conditions  that  suggest  the  enforcement  of 
policies  include  noting  that  many  plans  make  use  of 
the  state  associated  with  the  policy.  This  can  be  rec¬ 
ognized  through  the  a  process  of  failure  driven  learn¬ 
ing.  In  effect,  the  failure  to  have  cash  on  hand  sug¬ 
gests  that  it  is  a  good  policy  to  enforce.  There  are 
other  conditions  that  must  hold  as  well  however.  The 
POLICY  state  must  be  relatively  inexpensive  to  main¬ 
tain  and  the  state  should  be  a  useful  precondition  for 
a  wide  range  of  plans. 


4.6  Stability  of  cues 


One  effective  technique  for  improving  plan  perfor¬ 
mance  is  to  improve  the  proper  activation  of  a  plan 
rather  than  improve  the  plan  itself.  For  example,  plac¬ 
ing  an  important  paper  that  needs  to  be  reviewed  on 
his  desk  before  going  home,  improves  the  likelihood 
that  an  agent  will  see  and  read  it  the  next  day.  Mark¬ 
ing  calendars  and  leaving  notes  serves  the  same  sort 
of  purpose. 

One  important  area  of  enforcement  is  related  to 
this  use  of  visible  cue  in  the  environment  to  activate 
goals  that  have  been  suspended  in  memory.  The  idea 
driving  this  type  of  enforcement  is  that  an  agent  can 
decide  on  a  particular  cue  that  will  be  established  and 
maintained  so  as  to  force  the  recall  of  commonly  recur¬ 
ring  goals.  One  example  of  this  kind  of  enforcement 
of  STABILITY  OF  CUES  is  leaving  a  briefcase  by  the 
door  every  night  in  order  to  remember  to  bring  it  into 
work.  The  cue  itself  remains  constant  over  time.  This 


means  that  the  agent  never  has  to  maKe  an  effort  to 
recall  the  goal  at  execution- time  and,  because  the  cue 
is  stabilized,  it  also  never  has  to  reason  about  what 
cue  to  use  when  the  goal  is  initially  suspended. 

The  advantage  of  this  sort  of  enforcement  is  that 
an  agent  can  depend  on  the  external  world  to  provide 
a  stable  cue  to  remind  it  of  goals  that  still  have  to 
be  achieved.  This  sort  of  stability  is  suggested  when 
an  agent  is  faced  with  repeated  failures  to  recall  a 
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goal  and  the  plan  associated  with  the  goal  is  tied  to 
particular  objects  or  tools  in  the  world. 

5  A  Model  of  Agency 

Of  course,  this  work  on  enforcement  is  only  part  of  an 
overall  effort  in  the  study  of  agency  the  modeling  of  a 
system  that  is  able  to  understand,  plan  and  perform 
action  in  the  world.  We  use  the  term  agency  rather 
than  planning  because  we  are  attempting  to  model  a 
the  broader  class  of  tasks  including  the  spawning  of 
goals,  selection  of  plans,  and  execution  of  actions.  Our 
process  model  of  agency  is  based  on  Martin’s  DMAP 
understander  as  well  as  its  antecedent,  Schank’s  Dy¬ 
namic  Memory  (1982).  DMAP  uses  a  memory  orga¬ 
nization  defined  by  part/whole  and  abstraction  relar 
tionships.  Activations  from  environmentally  supplied 
features  are  passed  up  through  abstraction  links  and 
predictions  are  passed  down  through  the  parts  of  par¬ 
tially  active  concepts.  Subject  to  some  constraints, 
when  a  concept  has  only  some  of  its  parts  active,  it 
sends  predictions  down  its  other  parts.  When  acti¬ 
vations  meet  existing  predictions,  the  node  on  which 
they  meet  becomes  active.  Finally,  when  all  of  the 
parts  of  a  concept  are  activated,  the  concept  itself  is 
activated. 

To  accommodate  action,  we  have  added  the  notion 
of  PERMISSIONS.  Permissions  are  handed  down  the 
parts  of  plans  to  their  actions.  The  only  actions  that 
can  be  executed  are  those  that  are  permitted  by 
the  activation  of  existing  plans.  Following  McDer¬ 
mott  (McDermott,  1978),  we  have  also  added  POLI¬ 
CIES.  Policies  are  statements  of  ongoing  goals  of  the 
agent.  Sometimes  these  take  the  form  of  maintenance 
goals,  such  as  “Glasses  should  be  in  the  cupboard.”  or 
“Always  have  money  on  hand.”  The  only  goals  that 
are  actively  pursued  are  those  generated  out  of  the 
interaction  between  policies  and  environmental  fea¬ 
tures.  We  would  argue  that  this  is,  in  fact,  the  only 
way  in  which  goals  can  be  generated. 

Most  of  the  processing  takes  the  form  of  recogniz¬ 
ing  circumstances  in  the  external  world  as  well  as  the 
policies,  goals  and  plans  of  the  agent.  The  recognition 
is  then  translated  into  action  through  the  mediation 
of  PERMISSIONS  that  are  passed  to  physical  as  well  as 
mental  actions. 

Goals,  plans,  and  actions  interact  as  follows: 

•  Features  in  the  environment  interact  with  POLI¬ 
CIES  to  spawn  goals. 

For  example,  in  RUNNER,  the  specific  goal  to 
HAVE  COFFEE  is  generated  when  the  system  rec¬ 
ognizes  that  it  is  morning.  The  goal  itself  rises 
out  of  the  recognition  of  this  state  of  affairs  in 
combination  with  the  fact  that  there  is  a  policy 
in  place  to  have  coffee  at  certain  times  of  the  day. 

•  Goals  and  environmental  features  combine  to  ac¬ 
tivate  plans  already  in  memory. 


Any  new  make-coffee  plan  is  simply  the  acti¬ 
vation  of  the  sequence  of  actions  associated  with 
the  existing  MAKE-coFFEE  plan  in  memory.  It 
is  recalled  by  RUNNER  when  the  HAVE-COFFEE 
goal  is  active  and  the  system  recognizes  that  it 
is  at  home. 

•  Actions  are  permitted  by  plans  and  are  asso¬ 
ciated  with  the  descriptions  of  the  world  states 
appropriate  to  their  performance.  Once  a  set  of 
features  has  an  action  associated  with  it,  that 
set  of  features  (in  conjunct  rather  than  as  in¬ 
dividual  elements)  is  now  predicted  and  can  be 
recognized. 

Filling  the  coffee  pot  is  permitted  when  the 
MAKE-COFFEE  plan  is  active;  it  is  associated 
with  the  features  of  the  pot  being  in  view  and 
empty.  This  means  not  only  that  the  features 
are  now  predicted  but  also  that  their  recognition 
will  trigger  the  action. 

•  Actions  are  specialized  by  features  in  the  envi¬ 
ronment  and  by  internal  states  of  the  system. 
As  with  Firby’s  RAPs  (Firby,  1989),  particular 
states  of  the  world  determine  particular  methods 
for  each  general  action. 

For  example,  the  specifics  of  a  GRASP  would  be 
determined  by  information  taken  from  the  world 
about  the  size,  shape  and  location  of  the  object 
being  grasped. 

•  Action  level  confiicts  are  recognized  and  medi¬ 
ated  using  the  same  mechanism  that  recognizes 
information  about  the  current  state  of  the  world. 
For  example,  when  two  actions  are  active  (such 
as  filling  the  pot  and  filling  the  filter),  a  me¬ 
diation  action  selects  one  of  them.  During  the 
initial  phases  of  learning  a  plan,  this  can  in  turn 
be  translated  into  a  specialized  recognition  rule 
which,  in  the  face  of  a  conflict,  will  always  de¬ 
termine  the  ordering  of  the  specific  actions. 

•  Finally,  suspended  goals  are  associated  with  the 
descriptions  of  the  states  of  the  world  that  are 
amenable  to  their  satisfaction. 

For  example,  the  goal  HAVE-ORANGE-JUICE,  if 
blocked,  can  be  placed  in  memory,  associated 
with  the  conjunct  of  features  that  will  allow  its 
satisfaction,  such  as  being  at  a  store,  having 
money  and  so  forth.  Once  put  into  memory,  this 
conjunct  of  features  becomes  one  of  the  set  that 
can  now  be  recognized  by  the  agent. 

Eventually,  RUNNER  should  also  be  able  to  recog¬ 
nize  opportunities  to  interleave  plans  and  to  modify 
plans  in  response  to  different  types  of  failures. 

6  A  Framework  for  the  Study 
of  Agency 

We  do  not  see  this  model  as  a  solution  to  the  prob¬ 
lems  of  planning  and  action.  Instead,  we  see  this  as  a 
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framework  in  which  to  discuss  exactly  what  an  agent 
needs  to  know  in  a  changing  world.  Advantages  of 
this  framework  include: 

1.  A  unified  representation  of  goals,  plans,  actions 
and  conflict  resolution  strategies. 

2.  Ability  to  learn  through  specialization  of  general 
techniques. 

3.  A  fully  declarative  representation  that  allows  for 
meta-reasoning  about  the  planner’s  own  knowl¬ 
edge  base. 

4.  A  simple  marker-passing  scheme  for  recognition 
that  is  domain  —  and  task  —  neutral. 

5.  Provision  for  the  flexible  execution  of  plans  in 
the  face  of  a  changing  environment. 

The  basic  metaphors  of  action  as  permission  and 
recognition,  and  planning  as  the  construction  of  de¬ 
scriptions  that  an  agent  must  recognize  prior  to  ac¬ 
tion,  these  fit  our  intuitions  about  agency.  Under 
this  metaphor,  we  can  view  research  into  agency  as 
the  exploration  of  the  situations  in  the  world  that  are 
valuable  for  an  agent  to  recognize  and  respond  to.  In 
particular,  we  have  examined  and  continue  to  explore 
content  theories  of: 

•  The  conflicts  between  actions  that  rise  out  of 
resource  and  time  restrictions  as  well  as  direct 
state  conflicts  and  the  strategies  for  resolving 
them. 

•  The  types  of  physical  failures  that  block  execu¬ 
tion  and  their  repairs. 

•  The  types  of  knowledge-state  prcblcins  that 
block  planning  and  their  repairs. 

•  The  circumstances  that  actually  give  rise  to  goals 
in  the  presence  of  existing  policies. 

•  The  possible  ways  in  which  existing  plans  can 
be  merged  into  single  sequences  and  the  circum¬ 
stances  under  which  they  can  be  applied. 

•  The  types  of  reasoning  errors  that  an  agent  can 
make  and  their  repairs. 

•  The  trade-offs  that  an  agent  has  to  make  in  deal¬ 
ing  v.'ith  its  own  limits. 

•  The  different  ways  in  which  a  goal  can  be  blocked 
and  the  resulting  locations  in  memory  where  it 
should  be  placed. 

Our  goal  ij  u.  content  theory  of  agency.  The  archi¬ 
tecture  we  suggest  is  simply  the  vessel  for  that  con¬ 
tent.  Our  notion  of  enforcement  is  simply  one  aspect 
of  the  overall  content  model  that  comprises  our  theory 
of  agency. 

7  The  point 

In  order  to  plan  at  all  in  an  environment,  it  must 
at  least  be  stable  with  respect  to  its  basic  physics. 
In  order  to  reuse  plans  in  any  interesting  way  at  all. 


the  environment— including  the  agent — must  be  sta¬ 
ble  with  respect  to  other  aspects  as  well.  In  particular, 
it  must  be  stable  with  regard  to  the  physical  structure 
of  the  environment,  the  goals  that  tend  to  recur  and 
the  times  at  which  events  tend  to  take  place. 

While  many  environments  have  this  sort  of  stabil¬ 
ity,  it  is  often  the  product  of  the  intervention  of  agents 
attempting  to  stablize  it  so  as  to  increase  the  utility 
of  their  own  plans.  In  this  paper,  we  have  introduced 
the  idea  of  how  an  agent  could  take  a  strategic  ap¬ 
proach  to  tailoring  an  environment  to  its  plans  and 
the  goals  it  typically  must  achieve.  The  goal  of  this 
enforcement  parallels  the  goal  of  learning — the  devel¬ 
opment  of  a  set  of  effective  plans  that  can  be  applied 
to  satisfy  the  agent’s  goals.  The  path  toward  this  goal, 
however,  is  one  of  shaping  the  world  to  fit  the  agent’s 
plans  rather  than  shaping  the  agent  to  fit  the  world. 
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Abstract 

The  problem  of  learning  decision  rules  for 
sequential  tasks  is  addressed,  focusing  on  the 
problem  of  learning  tactical  plans  from  a 
simple  flight  simulator  where  a  plane  must 
avoid  a  missile.  The  learning  method  relies 
on  the  notion  of  competition  and  employs 
genetic  algorithms  to  search  the  space  of 
decision  policies.  Experiments  are  presented 
that  address  issues  arising  from  differences 
between  the  simulation  model  on  which 
learning  occurs  and  the  target  environment 
on  which  the  decision  rules  are  ultimately 
tested.  Specifically,  either  the  model  or  the 
target  environment  may  contain  noise.  These 
experiments  examine  the  effect  of  learning 
tactical  plans  without  noise  and  then  testing 
the  plans  in  a  noisy  environment,  and  the 
effect  of  learning  plans  in  a  noisy  simulator 
and  then  testing  the  plans  in  a  noise-free 
environment.  Empirical  results  show  that, 
while  best  result  are  obtained  when  the 
training  model  closely  matches  the  target 
environment,  using  a  Paining  environment 
that  is  more  noisy  than  the  target 
environment  is  better  than  using  using  a 
training  environment  that  has  less  noise  than 
the  target  environment. 

1  Introduction 

In  response  to  the  knowledge  acquisition  bottleneck 
associated  with  the  design  of  expert  systems,  research  in 
machine  learning  attempts  to  automate  the  knowledge 
acquisition  process  and  to  broaden  the  base  of  accessible 
sources  of  knowledge.  The  choice  of  an  appropriate 
learning  technique  depends  on  the  nature  of  the 
performance  task  and  the  form  of  available  knowledge. 
If  the  performance  task  is  classification,  and  a  large 
number  of  training  examples  arc  available,  then 
inductive  learning  techniques  (Michalski,  1983)  can  be 
used  to  leam  classification  rules.  If  there  exists  an 
extensive  domain  theory  and  a  source  of  expert  behavior, 
then  explanation-based  methods  may  be  applied 
(Mitchell  et.  al,  1985).  For  many  interesting  sequential 
decision  tasks,  there  exists  neither  a  database  of 


examples  nor  a  reliable  domain  theory.  In  these  cases, 
one  method  for  manually  developing  a  set  of  decision 
rules  is  to  test  a  hypothetical  set  of  rules  against  a 
simulation  model  of  the  task  environment,  and  to 
incrementally  modify  the  decision  rules  on  the  basis  of 
the  simulate  experience.  This  paper  presents  some 
initial  eflbrts  toward  using  machine  learning  to  automate 
the  process  of  learning  sequential  tasks  with  a  simulation 
model. 

Sequential  decision  tasks  may  be  charaefr-rized  by  the 
following  general  scenario;  A  decision  making  agent 
interacts  with  a  discrete-time  dynamical  system  in  an 
iterative  fashion.  At  the  beginning  of  each  time  step,  the 
agent  observes  a  representation  of  the  current  state  and 
selects  one  of  a  finite  set  of  actions,  based  on  the  agent’s 
decision  rules.  As  a  result,  the  dynamical  system  enters  a 
new  state  and  returns  a  (perhaps  null)  payoff.  This  cycle 
repeats  indefinitely.  The  objective  is  to  find  a  set  of 
decision  rules  that  maximizes  the  expected  total  payoff.* 
Several  sequential  decision  tasks  have  been  investigated 
in  the  machine  learning  literature,  including  pole 
balancing  (Selfridge,  Sutton  &  Barto,  1985),  gas  pipeline 
conuol  (Goldberg,  1983),  and  the  animat  problem 
(Wilson,  1987).  For  many  interesting  problems, 
including  the  one  considered  here,  payoff  is  delayed  in 
the  sense  that  non-null  payoff  occurs  only  at  the  end  of 
an  episode  that  may  span  several  decision  steps.  In  fact, 
the  paradigm  is  quite  broad  since  it  includes  any  problem 
solving  task  by  defining  the  payoff  to  be  positive  for  any 
goal  state  and  null  for  non-goal  states  (Barto,  Sutton  & 
Watkins,  1989). 

The  experiments  described  here  reflect  two  important 
methodological  assumptions: 

1.  Our  learning  system  is  designed  to  continue  learning 
indefinitely. 

2.  Since  learning  may  require  experimentin.g  with 
decision  rules  that  might  occasionally  produce 
unacceptable  results  if  applied  to  the  real  world,  we 
assume  that  hypothetical  rules  will  be  evaluated  in  a 
simulation  model. 


'  If  payoff  is  accumulated  over  an  infinite  period,  the  total  payoff  is 
usually  defmed  to  be  a  (finite)  time-weighted  sum  (Barto  et.  al, 
198')). 
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In  this  methodology,  a  set  of  rules  is  periodically 
extracted  from  the  learning  system  to  represent  the 
learning  system’s  current  plan.  This  plan  is  tested  in  the 
target  environment,  and  the  resulting  performance  is 
plotted  in  a  learning  curve.  Making  a  clear  distinction 
between  the  simulation  model  used  for  training  and  the 
target  environment  used  for  testmg  suggests  a  number  of 
experiments  that  measure  tlie  effects  of  differences 
between  the  baining  model  and  the  target  environment. 
Simulation  models  have  played  an  important  role  in 
earlier  machine  learning  efforts  (Goldberg,  1983; 
Booker,  1982,  1988;  Wilson,  1985;  Buchanan  et  al., 
1988):  however,  these  works  do  not  address  the  issue  of 
validation  of  the  simulation  model  with  respect  to  a  real 
target  system.  The  experiment  described  here  represents 
a  step  in  this  direction.  Specifically,  differences  between 
the  training  model  (the  simulator)  and  the  target 
environment  with  respect  to  noise  ate  examined. 


possible. 

The  EM  problem  is  divided  into  episodes  that  begin 
when  the  tlueatening  missile  is  detected  and  that  end 
when  either  the  plane  is  hit  or  the  missile  is  exhausted.^ 
It  is  assumed  that  the  only  feedback  provided  is  a 
numeric  payoff,  supplied  at  the  end  of  each  episode,  that 
reflects  the  quality  of  the  episode  with  respect  to  the  goal 
of  evading  the  missile.  J^ximum  payoff  is  given  for 
successfully  evading  the  missile,  and  a  smaller  payoff, 
based  on  how  long  the  plane  survived,  is  given  for 
unsuccessful  episodes. 

The  EM  problem  is  clearly  a  laboratory-scale  model 
of  realistic  tactical  problems.  Nevertheless,  it  includes 
several  features  that  make  it  a  challenging  machine 
learning  problem: 

•  a  weak  domain  knowledge  (e.g.,  no  predictive  mode) 

of  missile); 


2  The  Evasive  Maneuvers  Problem 


The  experiments  described  here  concern  a  particular 
sequential  decision  task  called  the  Evasive  Maneuvers 
(EM)  problem,  inspired  in  part  by  Erickson  and  Zytkow 
(1988).  The  tactical  objective  is  to  maneuver  a  plane  to 
avoid  being  hit  by  an  approaching  missile.  The  missile 
tracks  the  motion  of  the  plane  and  steers  toward  the 
plane’s  anticipated  position.  Tne  initial  speed  of  the 
missile  is  greater  than  that  of  the  plane,  but  the  missile 
loses  speed  as  it  maneuvers.  If  the  missile  speed  drops 
below  some  threshold,  it  loses  maneuverability  and  drops 
out  of  the  sky.  It  is  assumed  that  the  plane  is  more 
maneuverable  than  the  missile.  There  are  six  sensors  that 
provide  information  about  the  current  tactical  state: 

1.  last-turn,  the  current  turning  rate  of  the  plane; 

2.  time,  a  clock  that  indicates  time  since  detection  of  the 
missile; 

3.  range,  the  missile’s  current  distance  from  the  plane; 

4.  bearing,  the  direction  from  the  plane  to  the  missile; 

5.  heading,  the  missile’s  direction  relative  to  the  plane; 
and 


6.  speed,  the  missile’s  current  speed  measured  relative 
to  the  ground. 


Although  other  sensors  could  be  used,  these  sensors 
represent  a  minimal,  realistic  set  that  might  be  available 
from  the  pilots  point  of  view. 


Finally,  there  is  a  discrete  set  of  actions  available  to 
control  the  plane.  In  this  study,  we  consider  only  actions 
u.at  specify  discrete  turning  rates  for  the  plane.  The 
learning  objective  is  to  develop  a  tactical  plan,  i.e.,  a  set 
of  decision  rules  that  map  current  sensor  readings  into 
actions  that  successfully  evade  the  missile  whenever 


^  The  current  statement  of  the  problem  assumes  a  two-dimensional 
world.  Future  experiments  will  adopt  a  three-dimensional  model 
and  will  address  problems  with  multiple  control  variables,  such  as 
controlling  both  the  direction  and  the  speed  of  the  plane. 


•  incomplete  ctate  information  provided  by  discrete 

(possiWy,  noisy)  sensors; 

•  a  large  state  space;  and,  of  course, 

•  delayed  payoff. 

The  following  sections  present  one  approach  to 
addressing  these  challenges. 

3  SAMUEL  on  EM 

Samuel**  is  a  system  designed  to  explore 
competition-based  learning  for  sequential  decision  tasks 
(Grefenstette,  1989).  Samuel  consists  of  three  major 
components:  a  problem  specific  module,  a  performance 
module,  and  a  learning  module.  The  problem  specific 
module  consists  of  the  task  environment  simulation,  or 
world  model  (in  this  case,  the  EM  model),  and  its 
interfaces.  The  performance  module  is  ciled  CPS 
(Competitive  Production  System),  a  production  system 
that  interacts  with  the  world  model  by  reading  sensors, 
setting  control  variables,  and  obtaining  payoff  from  a 
critic.  In  addition  to  matching,  CPS  implements  conflict 
resolution  as  a  competition  among  rules  based  on  rule 
strength  and  performs  credit  assignment  based  on  payoff 
(Grefenstette,  1988).  The  learning  module  uses  a  genetic 
dgorithm  to  develop  tactical  plans,  expressed  as  a  set  of 
condition-action  rules.  Each  plan  is  evaluated  on  a 
number  of  tasks  in  the  world  model.  As  a  result  of  these 
evaluations,  genetic  operators,  such  as  crossover  and 
mutation,  produce  plausible  new  plans  from  high 
performance  parents.  More  detailed  descriptions  of 
Samuel  appear  in  (Grefenstette,  1989;  Grefenstette, 
Ramsey  &  Schultz,  1990). 


^  For  the  experiments  described  here,  the  missile  began  each 
episode  at  a  fixed  distance  from  the  plane,  traveling  toward  the 
plane  at  a  fixed  speed.  TIic  direction  from  which  the  missile 
approached  was  selected  at  random. 

*  Samuel  stands  for  Strategy  Acquisition  Method  Using  Empirical 
Learning.  The  name  also  honors  Art  Samuel,  one  of  the  pioneers  in 
machine  learning. 
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The  design  of  Samuel  owes  much  to  Smith’s  LS-1 
system  (Smith,  1980),  and  draws  on  some  ideas  from 
classifier  systems  (Holland,  1986).  In  a  departure  from 
these  earlier  genetic  learning  systems,  Samuel  learns 
plans  consisting  of  rules  cxpres^  in  a  high  level  rule 
language. 

The  rule  language  allows  four  types  of  sensors: 

1.  linear,  where  the  condition  specifies  an  upper  and 
lower  bound  for  the  linear  ordered  values; 

2.  cyclic,  which  are  like  linear,  except  that  the  values 
wrap  around  from  the  upper  bound  to  the  lower 
bound; 

3.  structures,  where  the  sensor  specifies  a  list  of  values, 
and  the  condition  matches  if  the  sensor’s  current 
value  occurs  in  a  subtree  labeled  by  one  of  the  values 
in  the  list;  and 

4.  pattern,  where  the  sensor  specifies  a  pattern  over  the 
alphabet  (0, 1,  #},  as  in  classifier  systems. 

In  the  EM  domain,  only  linear  and  cyclic  type  sensors 
are  used.  An  example  of  a  rule  for  EM  follows; 

if  (and  (last-turn  0  45)  (time  4  14)  (range  500  1400) 
(heading  330  90)  (sjxsed  50  850)) 
then  (and  (turn  90)) 
strength  750 

Each  condition  (if  part)  specifies  a  range  over  the  named 
sensor,  and  each  action  (then  part)  specifies  the  value  for 
the  natned  control  variable.  The  strength  is  an  estimate 
of  the  rule’s  utility  and  is  used  for  conflict  resolution 
(Grefenstette,  1988). 

The  use  of  a  high  level  language  for  rules  offers 
several  advantages  over  low  level  binary  pattern 
languages  typically  adopted  in  genetic  learning  systems 
(Smith,  1980;  Goldberg,  1983).  First,  it  makes  it  easier 
to  incorporate  existing  knowledge,  whether  acquired 
from  experts  or  by  symbolic  learning  programs.  Second, 
it  is  easier  to  transfer  the  knowledge  learned  to  human 
operators.  Third,  it  makes  it  possible  to  combine 
empirical  methods  such  as  genetic  algorithms  with 
analytic  learning  methods  that  explain  the  success  of  the 
empirically  derived  rules  (Gordon  &  Grefenstette,  1990). 

4  Evaluation  of  the  Method 

This  section  presents  an  empirical  study  of  the 
performance  of  Samuel  on  the  EM  problem  with  respect 
to  the  differences  between  the  simulation  model  in  which 
the  kno'.vledge  is  learned  and  the  target  environment  in 
which  the  learned  knowledge  will  be  used. 

4.1  Experimental  Design 

The  learning  curves  shown  in  this  section  reflect  our 
assumptions  about  the  methodology  of  simulation- 
assisted  learning.  Ir  particular,  we  make  a  distinction 
between  the  world  model  used  for  learning  and  the  target 
environment.  Let  E  denote  the  target  environment  for 


which  we  want  to  learn  decision  rules.  Let  M  denote  a 
simulation  model  of  E  that  can  be  used  for  learning.  The 
assumption  is  that  learning  continues  indefinitely  in  the 
background  using  system  M,  while  the  plan  being  used 
on  E  is  periodically  updated  with  the  current 
hypothetical  plan  of  the  learning  system.  The  genetic 
algorithm  in  SAMUEL  evaluates  the  fitness  of  each  plan 
in  its  population  by  measuring  its  performance  on  a 
number  of  episodes  (currently,  10  episodes  per  plan)  in 
the  training  model  M.  A  plan’s  fitness  determines  its 
reproductive  probability  for  the  next  generation.  At 
periodic  intervals  (currently,  10  generations),  a  single 
plan  is  extracted  from  the  current  population  to  represent 
the  learning  system’s  current  hypothetical  plan.  The 
extraction  is  accomplished  by  re-evaluating  the  top  20% 
of  the  current  population  on  100  randomly  chosen 
qrisodes  on  the  simulation  model  M.  The  plan  with  the 
b^t  performance  in  this  phase  is  designated  the  current 
hypothesis  of  the  learning  system.  This  plan  is  tested  in 
the  environment  E  for  1(X)  randomly  chosen  problem 
episodes.  The  plots  show  the  sequence  of  results  of 
testing  on  E,  using  the  current  plans  periodically 
extracted  from  the  learning  system.  Distinguishing  these 
two  systems  permits  the  study  of  how  well  the  learned 
plans  behave  if  E  varies  significantly  from  M,  as  is  likely 
in  practice. 

Because  Samuel  employs  probabilistic  learning 
methods,  all  graphs  represent  the  mean  performance  over 
20  independent  runs  of  the  system,  each  run  using  a 
different  seed  for  the  random  number  generatw.  When 
two  learning  curves  are  plotted  on  the  same  graph,  a 
vertical  line  between  the  curves  indicates  that  there  is  a 
statistically  significant  difference  between  the  means 
represented  by  the  respective  plots  (with  significance 
level  a  =  0.05)  at  that  point  on  the  curves.  This  device 
allows  the  reader  to  see  significant  differences  between 
two  approaches  at  various  points  during  the  learning 
pRx:ess.  We  feel  that  this  is  a  better  way  to  compare 
learning  curves  for  continuously  learning  systems  than, 
say,  running  the  two  systems  for  a  fixed  amount  of  time 
and  comparing  the  performance  of  the  final  plans. 

4.2  Sensitivity  to  Sensor  Noise 

Noise  is  an  important  topic  in  machine  learning 
research,  since  real  environments  can  not  be  expected  to 
behave  as  nicely  as  laboratory  ones.  While  there  are 
many  aspects  of  a  real  environment  that  are  likely  to  be 
noisy,  we  can  identify  three  major  sources  of  noise  in  the 
kinds  of  sequential  decision  tasks  for  which  Samuel  is 
designed:  sensor  noise,  effector  noise,  and  payoff  noise. 
Sensor  noise  refers  to  errors  in  sen.soi  daia  caused  by 
imperfect  sensing  devices.  For  example,  if  the  radar 
indicates  that  the  range  to  an  object  is  KXX)  meters  when 
in  fact  the  range  is  875  meters,  the  perfoniia.*’ce  system 
has  received  noisy  data.  Effector  noise  refers  to  errors 
arising  when  effectors  fail  to  perform  the  action  indicated 
by  the  current  control  settings.  For  example,  in  the  EM 
world,  an  effector  command  might  be  (turn  45),  meaning 
that  the  effectors  should  initiate  a  45  degree  left  turn.  If 
the  plane  turns  at  a  different  rate,  the  effector  has 
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introduced  some  noise.  An  additional  source  of  noise 
can  arise  during  the  learning  phase,  if  the  critic  gives 
noisy  feedback.  For  example,  a  noisy  critic  might  issue  a 
high  payoff  value  for  an  episode  in  which  the  plane  is  hit. 

While  all  of  these  types  of  noise  are  interesting,  we 
restrict  our  attention  to  the  noise  caused  by  sensors.  To 
test  the  effects  of  sensor  noise  on  Samuel,  two 
environments  were  defined.  In  one  environment,  the 
sensors  are  noise-free.  In  the  second  environment,  noise 
is  added  to  each  of  the  four  external  sensors  that  indicate 
the  missile’s  range,  bearing,  heading,  and  speed.  Noise 
consists  of  a  random  draw  from  a  normal  distribution 
with  mean  0.0  and  standard  deviation  equal  to  10%  of 
the  legal  range  for  the  corresponding  sensor.  The 
resulting  value  is  then  discretized  according  to  the 
defined  granularity  of  the  sensor.  For  example,  suppose 
the  missile’s  true  heading  is  66  degrees.  'Hie  noise 
consists  of  a  random  draw  from  a  normal  distribution 
with  standard  deviation  36  (10%  of  360  degrees), 
resulting  in  a  value  of,  say,  22.  The  noisy  result,  88,  is 
then  discretized  to  the  nearest  10  degree  bound^  (as 
specified  by  the  granularity  of  the  heading  sensor),  and 
the  final  sensor  reading  is  90.  As  this  example  shows,  the 
amount  of  noise  in  this  environment  is  rather  substantial. 

Given  the  noisy  environment  and  the  noise-free 
environment,  there  are  four  possible  experimental 
conditions,  depending  on  which  environment  is  used  for 
the  simulation  model  (M)  and  which  is  used  for  the  target 
environment  (E).  For  each  experimental  condition,  the 
genetic  algorithm  was  executed  for  200  generations,  and 
the  results  are  shown  in  Figures  1  and  2. 
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In  Figure  1,  training  was  performed  in  the  model  with 
noise-free  sensors,  and  the  resulting  plans  were  tested  in 
the  environment  with  noise-free  sensors  (dashed  curve) 
and  noisy  sensors  (solid  curve).  In  Figure  2,  training  was 
perform^  in  the  model  with  noisy  sensors,  and  again 
tested  with  both  noise-free  sensors  (dashed  curve)  and 
noisy  sensors  (solid  curve). 


Generations 


Figure  2:  Learning  with  Noisy  Sensors 

Figure  1  shows  that  plans  learned  with  noise-free 
sensors  perform  significantly  worse  throughout  the 
learning  period  when  the  testing  environment  contains 
noise  (sofid  curve)  than  when  the  testing  environment  is 
noise-free  (dashed  curve).  For  example,  after  200 
generations,  the  current  tactical  plan  evades  the  missile 
about  98%  of  the  time  when  the  sensors  are  noise-free, 
but  only  about  70%  of  the  time  when  the  sensors  are 
noisy.  On  the  other  hand.  Figure  2  shows  that  the  plans 
learned  under  noisy  conditions  perform  fairly  well  in 
both  target  environments.  After  200  generations,  the 
current  tactical  plan  evades  the  missile  about  88%  of  the 
time  when  the  sensors  arc  noisy,  and  about  92%  when 
the  sensors  are  noise-free.  Clearly,  more  robust  niles  are 
being  learned,  at  a  cost  of  slower  improvement 

By  comparing  the  two  dashed  curves  (or  the  two  solid 
curves)  in  Figures  1  and  2,  it  may  be  concluded  that,  for 
a  fixed  target  environment,  Samuel  leams  best  when  the 
training  environment  matches  the  target  environment. 
However,  this  ideal  case  wiU  not  generally  be  realized  in 
practice,  especially  if  the  target  is  a  real-world  system 
and  the  training  model  is  a  simulation.  Our  results  show 
that  using  a  training  environment  that  is  less  regular  (in 
this  case,  more  noisy)  than  the  target  environment  is 
better  than  having  a  training  model  with  spurious 
regularities  (e.g.,  noise-free  sensors)  that  do  not  occur  in 
the  target  environment. 

5  Summary  and  Further  Research 

One  important  lesson  of  the  empirical  study  is  that 
Samuel  is  an  opportunistic  learner,  and  will  tailor  the 
plans  that  it  leams  to  the  regularities  it  finds  in  the 
training  model.  It  follows  that  the  closer  the  training 
model  matches  the  conditions  expected  in  the  target 
environment  --  in  terms  of  sensor  noise  -  ti:e  better  the 
learned  plans  will  be.  In  the  absence  of  a  perfect  match 
between  training  model  and  target  enviro»’ment,  it  is 
better  to  have  too  little  regularity  in  the  training  model 
than  too  much. 
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This  study  illustrates  just  one  of  many  investigations 
one  might  pursue  in  assessing  the  effects  of  differences 
between  the  model  and  the  target  environment.  In 
another  study,  we  have  examined  the  effect  of  differences 
in  the  initial  conditions  (i.e.,  the  missile’s  initial  range, 
speed,  and  heading)  between  the  training  model  and  the 
target  environment.  Preliminary  results  support  the 
general  conclusion  reported  here  -  that  it  is  far  less  risky 
to  have  a  training  model  with  overly  general  inititd 
conditions  than  to  have  one  with  overly  restricted  initial 
conditions  (Grefenstette  et.  al,  1990). 

Current  efforts  are  also  aimed  at  augmenting  the  task 
environment  to  test  Samuel’s  ability  to  learn  tactical 
plans  for  more  realistic  scenarios.  Multiple  incoming 
threats  will  be  considered,  as  well  as  multiple  control 
variables  (e.g.,  accelerations,  directions,  weapons,  etc.). 

As  simulation  technology  improves,  it  will  become 
possible  to  provide  learning  systems  with  h’gh  fidelity 
simulations  of  tasks  whose  complexity  or  uncertainty 
precludes  the  use  of  traditional  Imowl^ge  engineering 
methods.  No  matter  what  the  degree  of  sophistication  of 
the  simulator,  it  will  be  important  to  assess  the  effects  on 
any  learning  method  of  the  differences  between  the 
simulation  model  and  the  target  enviionment.  These 
initial  studies  with  a  simple  tactical  problem  have  shown 
that  it  is  possible  for  learning  systems  based  on  genetic 
algorithms  to  effectively  search  a  space  of  knowledge 
structures  and  discover  sets  of  rules  that  provide  high 
performance  in  a  variety  of  target  environments.  Further 
developments  along  these  lines  can  be  expected  to 
reduce  the  manual  Imowledge  acquisition  effort  required 
to  build  systems  with  expert  performance  on  complex 
sequential  decision  tasks. 
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Abstract 

This  paper  extends  previous  work  with  Dyna, 
a  class  of  architectures  for  intelligent  systems 
based  on  approximating  dynamic  program¬ 
ming  methods.  Dyna  architectures  integrate 
trial-and-error  (reinforcement)  learning  and 
execution-time  planning  into  a  single  process 
operating  alternately  on  the  world  and  on  a 
learned  model  of  the  world.  In  this  paper,  I 
present  and  show  results  for  two  Dyna  archi¬ 
tectures.  The  Dyna-PI  architecture  is  based 
on  dynamic  programming’s  policy  iteration 
method  and  can  be  related  to  existing  AI 
ideas  such  as  evaluation  functions  and  uni¬ 
versal  plans  (reactive  systems).  Using  a  nav¬ 
igation  task,  results  are  shown  for  a  simple 
Dyna-PI  system  that  simultaneously  learns 
by  trial  and  error,  learns  a  world  model,  and 
plans  optimal  routes  using  the  evolving  world 
model.  The  Dyna-Q  architecture  is  based 
on  Watkins’s  Q-learning,  a  new  kind  of  rein¬ 
forcement  learning.  Dyna-Q  uses  a  less  famil¬ 
iar  set  of  data  structures  than  does  Dyna-PI, 
but  is  arguably  simpler  to  implement  and  use. 
We  show  that  Dyna-Q  architectures  are  easy 
to  adapt  for  use  in  changing  environments. 


1  Introduction  to  Dyna 

How  should  a  robot  decide  what  to  do?  The  traditional 
answer  in  AI  has  been  that  it  should  deduce  its  best 
action  in  light  of  its  current  goals  and  world  model, 
i.e.,  that  it  should  plan.  However,  it  is  now  widely 
recognized  that  planning’s  usefulness  is  limited  by  its 
computational  complexity  and  by  its  dependence  on 
an  accurate  world  model.  An  alternative  approach  is 
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into  a  set  of  rapid  reactions,  or  situation-action  rules, 
which  are  then  used  for  real-time  decision  making.  Yet 
a  third  approach  is  to  learn  a  good  set  of  reactions  by 
trial  and  error;  this  has  the  advantage  of  eliminating 


the  dependence  on  a  world  model.  In  this  paper  I 
briefly  introduce  Dyna,  a  class  of  simple  architectures 
integrating  and  permitting  tradeoffs  among  these  three 
approaches. 

Dyna  architectures  use  machine  learning  algo¬ 
rithms  to  approximate  the  conventional  optimal  con¬ 
trol  technique  known  as  dynamic  programming  (DP) 
(Bellman,  1957;  R.oss,  1983).  DP  itself  is  not  a  learn¬ 
ing  method,  but  rather  a  computational  method  for 
determining  optimal  behavior  given  a  complete  model 
of  the  task  to  be  solved.  It  is  very  similar  to  state- 
space  search,  but  differs  in  that  it  is  more  incremental 
and  never  considers  actual  action  sequences  explicitly, 
only  single  actions  at  a  time.  This  makes  DP  more 
amenable  to  incremental  planning  at  execution  time, 
and  also  makes  it  more  suitable  for  stochastic  or  in¬ 
completely  modeled  environments,  as  it  need  not  con¬ 
sider  the  extremely  large  number  of  sequences  possi¬ 
ble  in  an  uncertain  environment.  Learned  world  mod¬ 
els  are  likely  to  be  stochastic  and  uncertain,  making 
DP  approaches  particularly  promising  for  learning  sys¬ 
tems.  Dyna  architectures  are  those  that  learn  a  world 
model  online  while  using  approximations  to  DP  to 
learn  and  plan  optimal  behavior. 

Intuitively,  Dyna  is  based  on  the  old  idea  that 
planning  is  like  trial-and-error  learning  from  hypothet¬ 
ical  experience  (Craik,  1943;  Dennett,  1978).  The 
theory  of  Dyna  is  based  on  the  theory  of  DP  (e.g., 
Ross,  1983)  and  on  DP’s  relationship  to  reinforcement 
learning  (Watkins,  1989,  Barto,  Sutton  &  Watkins, 
1989,  1990),  to  temporal-difference  learning  (Sutton, 
1988),  and  to  AI  methods  for  planning  and  search 
(Korf,  1990).  Werbos  (1987)  has  previously  argued  for 
the  general  idea  of  building  AI  systems  that  approx¬ 


imate  dynamic  programming,  and  Whitehead  (1080) 
and  others  (Sutton  &  Barto,  1981,  Sutton  &  Pinette, 
1985,  Rumelhart  et  al.,  1986)  have  presented  results 
for  the  specific  idea  of  augmenting  a  reinforcement 
learning  system  with  a  world  model  used  for  planning. 
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2  Dyna-PI:  Dyna  by  Approximating 
Policy  Iteration 

I  call  the  first  Dyna  architecture  Dyna-PI  because  it 
is  based  on  approximating  a  DP  method  known  as  pol¬ 
icy  xieraiton  (Howard,  1960).  The  Dyna-PI  architec¬ 
ture  consists  of  four  components  interacting  as  shown 
in  Figure  1.  The  policy  is  simply  the  function  formed 
by  the  current  set  of  reactions;  it  receives  as  input  a 
description  of  the  current  state  of  the  world  and  pro¬ 
duces  as  output  an  action  to  be  sent  to  the  world. 
The  world  represents  the  task  to  be  solved;  prototypi- 
cally  it  is  the  robot’s  external  environment.  The  world 
receives  actions  from  the  policy  and  produces  a  next 
state  output  and  a  reward  output.  The  overall  task  is 
defined  as  maximizing  the  long-term  average  reward 
per  time  step  (cf.  Russell,  1989).  The  architecture  also 
includes  an  explicit  world  model.  The  world  model  is 
intended  to  mimic  the  one-step  input-output  behavior 
of  the  real  world.  Finally,  the  Dyna-PI  architecture  in¬ 
cludes  an  evaluation  function  that  rapidly  maps  states 
to  values,  much  as  the  policy  rapidly  maps  states  to 
actions.  The  evaluation  function,  the  policy,  and  the 
world  model  are  each  updated  by  separate  learning 
processes. 


Figure  1;  Overview  of  the  Dyna  Architecture.  With 
the  world  in  place  as  shown  we  have  reinforcement 
learning;  with  the  world  model  switched  in  place  of 
the  world  we  have  planning. 


For  a  fixed  policy,  Dyna-PI  is  simply  a  reactive  sys¬ 
tem.  However,  the  policy  is  continually  adjusted  by  an 
integrated  planning/learning  process.  The  policy  is,  in 
a  sense,  a  plan,  but  one  that  is  completely  conditioned 
by  current  input.  The  planning  process  is  incremental 
and  can  be  interrupted  and  resumed  at  any  time.  It 
consists  of  a  series  of  shallow  seaches,  each  typically 
or  one  step  ply,  and  yet  ultimately  produces  the  same 
result  as  an  arbitrarily  deep  conventional  search.  I  call 
this  relaxation  planning.  Dynamic  programming  is  a 
special  case  of  this. 

Relaxation  planning  is  based  on  continually  adjust¬ 
ing  the  evaluation  function  in  such  a  way  that  credit 
is  propagated  to  the  appropriate  steps  within  action 
sequences.  Generally  speaking,  the  evaluation  e(a;)  of 
a  state  x  should  be  equal  to  the  best  of  the  states  y 
that  can  be  reached  from  it  in  one  action,  taking  into 
consideration  the  reward  (or  cost)  r  for  that  one  tran¬ 
sition; 


e(a;)  “  =  ”  max  ^  {r -f  e(j/)  |  a;,  a),  (1) 

a€i4c(ton5 

where  1  •}  denotes  a  conditional  expected  value 
and  the  equal  sign  is  quoted  to  indicate  that  this  is  a 
condition  that  we  would  like  to  hold,  not  one  that  nec¬ 
essarily  does  hold.  If  we  have  a  complete  model  of  the 
world,  then  the  right-hand  side  can  be  computed  by 
looking  ahead  one  action.  Thus  we  can  generate  any 
number  of  training  examples  for  the  process  that  learns 
the  evaluation  function:  for  any  x,  the  right-hand  side 
of  (1)  is  the  desired  output.  If  the  learning  process 
converges  such  that  (1)  holds  in  all  states,  then  the 
optimal  policy  is  given  by  choosing  the  action  in  each 
state  X  that  achieves  the  maximum  on  the  right-hand 
side.  There  is  an  extensive  theoretical  basis  from  dy¬ 
namic  programming  for  algorithms  of  this  type  for  the 
special  case  in  which  the  evaluation  function  is  tabu¬ 
lar,  with  enumerable  states  and  actions.  For  example, 
this  theory  guarantees  convergence  to  a  unique  evalua¬ 
tion  function  satisfying  (1)  and  that  the  corresponding 
policy  is  optimal  (Ross,  1983). 

The  evaluation  function  and  policy  need  not  be  ta¬ 
bles,  but  can  be  more  compact  function  approxima¬ 
tors  such  as  decision  trees,  k-d  trees,  connectionist  net¬ 
works,  or  symbolic  rules.  Although  the  existing  theory 
does  not  apply  to  these  machine  learning  algorithms 
directly,  it  does  provide  a  theoretical  foundation  for 
exploring  their  use  in  this  way.  This  kind  of  planning 
also  extends  conventional  state-space  planning  in  that 
it  is  applicable  to  stochastic  and  uncertain  worlds  and 
to  non-boolean  goals. 

The  above  discussion  gives  the  general  idea  of  re¬ 
laxation  planning,  but  not  the  exact  form  used  in 
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policy  iteration  and  Dyna-PI,  in  which  the  policy  is 
adapted  simultaneously  with  the  evaluation  function. 
The  evaluations  in  this  case  are  not  supposed  to  re¬ 
flect  the  value  of  states  given  optimal  behavior,  but 
rather  their  value  given  current  behavior  (the  current 
policy).  As  the  current  policy  gradually  approaches 
optimality,  the  evaluation  function  also  approaches  the 
ptimal  evaluation  function.  In  addition,  Dyna-PI  is 
.0  Monte  Carlo  or  stochastic  approximation  variant  of 
policy  iteration,  in  which  the  world  model  is  only  sam¬ 
pled,  not  examined  directly.  Since  the  real  world  can 
aleo  be  sampled,  by  actually  taking  actions  and  ob¬ 
serving  the  result,  the  world  can  be  used  in  place  of 
the  world  model  in  these  methods.  In  this  case,  the 
result  is  not  relaxation  planning,  but  a  trial-and-error 
learning  process  much  like  reinforcement  learning  (see 
Bprto,  Sutton  &  Watkins,  1989,  1990).  In  Dyna-PI, 
both  of  these  are  done  at  once.  The  same  algorithm  is 
applied  both  to  real  experience  (resulting  in  learning) 
and  to  hypothetical  experience  generated  by  the  world 
model  (resulting  in  relaxation  planning).  The  results 
in  boia  cases  are  accumulated  in  the  policy  and  the 
evaluation  function. 

There  is  insufficient  room  here  to  fully  justify  the 
algorithm  used  in  Dyna-PI,  but  it  is  quite  simple  and 
is  given  in  outline  form  in  Figure  2.  The  algorithm  is 
based  on  a  version  of  (1)  modified  to  discount  later  as 
opposed  to  immediate  reward: 


e(a:)  “  =  ”  max  F;{r-l-7e(y)  |  i,a},  (2) 

a^Actions 


where  7,  0  <  7  <  1,  is  the  discount  rate.  Whereas  (1) 
is  limited  to  tasks  that  end  with  a  clear  termination 
event,  such  as  the  finding  of  a  goal  state  or  the  end  of 
a  board  game,  (2)  can  be  used  for  tasks  that  continue 
indefinitely,  with  rewards  and/or  penalties  arriving  on 
each  step.  Algorithms  based  on  (2)  are  meant  to  esti¬ 
mate  and  maximize  the  expected  value  of  a  discounted 
sum  of  future  reward: 


where  ri,  r2,  ra, . . .  is  the  sequence  of  future  rewards. 
This  is  a  standard  optimization  criterion  in  dynamic 
programming  and  Markov  decision  processes. 


3  A  Navigation  Task 

As  an  illustration  of  the  Dyna-PI  architecture,  con¬ 
sider  the  task  of  navigating  the  maze  shown  in  the 
upper  right  of  Figure  3.  The  maze  is  a  6  by  9  grid 
of  possible  locations  or  states,  one  of  which  is  marked 
as  the  starting  state,  “S” ,  and  one  of  which  is  marked 


1.  Decide  if  this  will  be  a  real  experience  or  a  hypo¬ 
thetical  one. 

2.  Pick  a  state  x.  If  this  is  a  real  experience,  use  the 
current  state. 

3.  Choose  an  action:  a  «—  Policy{x) 

4.  Do  action  a;  obtain  next  state  y  and  reward  r  from 
world  or  world  model. 

5.  If  this  is  a  real  experience,  update  world  model 
from  z,  a,  y  and  r. 

6.  Update  evaluation  function  so  that  e{x)  is  more 
like  »’-f-7e(j/);  this  is  temporal-difference  learning. 

7.  Update  policy — strengthen  or  weaken  the  ten¬ 
dency  to  perform  action  a  in  state  x  according  to 
the  error  in  the  evaluation  function:  r  +  ye{y)  - 
e(z). 

8.  Go  to  Step  1. 

Figure  2.  Inner  Loop  of  the  Dyna-PI  Algorithm. 
These  steps  are  repeatedly  continually,  sometimes  with 
real  experiences,  sometimes  with  hypothetical  ones. 


as  the  goal  state,  “G” .  The  shaded  states  act  as  bar¬ 
riers  and  cannot  be  entered.  All  the  other  states  are 
distinct  and  completely  distinguishable.  From  each 
there  are  four  possible  actions:  UP,  DOWN,  RIGHT, 
and  LEFT,  which  change  the  state  accordingly,  except 
where  such  a  movement  would  take  the  take  the  system 
into  a  barrier  or  outside  the  maze,  in  which  case  the 
location  is  not  changed.  Reward  is  zero  for  all  tran¬ 
sitions  except  for  those  into  the  goal  state,  for  which 
it  is  -fl.  Upon  entering  the  goal  state,  the  system  is 
instantly  transported  back  to  the  start  state  to  begin 
the  next  trial.*  None  of  this  structure  and  dynamics 
is  known  to  the  Dyna-PI  system  a  priori. 

In  this  demonstration,  the  world  was  assumed  to 
be  deterministic,  that  is,  to  be  a  finite-state  automa¬ 
ton,  and  the  world  model  was  implemented  simply  as 
next-state  and  reward  tables  that  were  filled  in  when¬ 
ever  a  new  state-action  pair  was  experienced  (Step  5 
of  Figure  2).  The  evaluation  function  was  also  imple¬ 
mented  as  a  table  and  was  updated  (Step  6)  according 
to  the  simplest  temporal-difference  learning  method: 
e(x)  ♦—  e(z)+/3(r-}-7e(t/)-e(z)) ,  where  P  is  apositive 
learning-rate  parameter.  The  policy  was  implemented 
as  a  table  with  an  entry  Wxa  for  every  pair  of  state  x 
and  action  a.  Actions  were  selected  (Step  3)  stochasti¬ 
cally  according  to  a  Boltzmann  distribution:  P(a|z)  = 
e'""  ■  The  policy  was  updated  (Step  8)  ac¬ 

cording  to:  ly^a  ♦-  Wxa  -f-  af(>*  -b  je{i/)  -  e(z)).  For 

*In  fact,  the  goal  state  is  never  entered;  the  UP  action 
from  the  state  below  produces  a  reward  of  -bl  and  sends 
the  system  directly  to  the  start  state. 


Integrated  Architecture  for  Learning,  Planning,  and  Reacting  219 


hypothetical  experiences,  states  were  (Step  2) 

at  random  uniformly  over  all  states  previousli  v,ncoun- 
tered.  The  initial  values  of  the  evaluation  function  e(x) 
and  the  policy  table  entries  w^a  were  all  zero;  the  ini¬ 
tial  policy  was  thus  a  random  walk.  The  world  model 
was  initially  empty;  if  a  state  and  action  were  selected 
for  a  hypothetical  experience  that  had  never  been  ex¬ 
perienced  in  reality,  then  the  following  steps  (Steps 
4-7)  were  simply  omitted. 

In  this  instance  of  the  Dyna-PI  architecture,  real 
and  hypothetical  experiences  were  used  alternately 
(Step  1).  For  each  experience  with  the  real  world,  k  hy¬ 
pothetical  experiences  were  generated  with  the  model. 
Figure  3  shows  learning  curves  for  I:  =  0,  I:  =  10, 
and  k  =  100,  each  an  average  over  100  runs.  The 
it  =  0  case  involves  no  planning;  this  is  a  pure  trial- 
and-error  learning  system  entirely  analogous  to  those 
used  in  some  reinforcement  learning  systems  (Barto, 
Sutton  &  Anderson,  1983;  Sutton,  1984;  Anderson, 
1987).  Although  the  length  of  path  taken  from  start 
to  goal  falls  dramatically  for  this  case,  it  falls  much 
more  rapidly  for  the  cases  including  hypothetical  ex¬ 


TRIALS 

Figure  3.  Learning  Curves  for  Dyna-PI  Systems  on 
a  Simple  Navigation  Task.  A  trial  is  one  trip  from 
the  start  state  “S”  to  the  goal  state  “G”.  The  more 
hypothetical  experiences  (“planning  steps”)  using  the 
world  model,  the  faster  an  optimal  path  was  found. 


periences,  showing  the  benefit  of  relaxation  planning 
using  the  learned  world  model.  For  k  =  100,  the  op¬ 
timal  path  was  generally  found  and  followed  by  the 
fourth  trip  from  start  to  goal;  this  is  very  rapid  learn¬ 
ing.  The  parameter  values  used  were  /?  =  0.1,  y  =  0.9, 
and  a  =  1000  (k  =  0)  or  a  =  10  (k  =  10  and  k  r-  100). 
The  a  values  were  chosen  roughly  to  give  the  best  per¬ 
formance  for  each  k  value. 

Figure  4  shows  why  a  Dyna-PI  system  that  includes 
planning  solves  this  problem  so  much  faster  than  one 
that  does  not.  Shown  are  the  policies  found  by  the 
ib  =  0  and  k  =  100  Dyna-PI  systems  half-way  through 
the  second  trial.  Without  planning  (1:  =  0),  each  trial 
adds  only  one  additional  step  to  the  policy,  and  so 
only  one  step  (the  last)  has  been  learned  so  far.  With 
planning,  the  first  trial  also  learned  only  one  step,  but 
here  during  the  second  trial  an  extensive  policy  has 
been  developed  that  by  the  trial’s  end  will  reach  almost 
back  to  the  start  state.  By  the  end  of  the  third  or 
fourth  trial  a  complete  optimal  policy  will  have  been 
found  and  perfect  performance  attained. 


WITHOUT  PLANNING  (k  =  0) 


WITH  PLANNING  (k  =  100) 
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Figure  4.  Policies  Found  by  Planning  and  Non- 
Planning  Dyna-PI  Systems  by  the  Middle  of  the  Sec¬ 
ond  Trial.  The  black  square  indicates  the  current  lo¬ 
cation  of  the  Dyna-PI  system.  The  arrows  indicate 
action  probabilities  (excess  over  the  smallest)  for  each 
direction  of  movement. 
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4  Problems  of  Changing  Worlds  5  Dyna-Q:  Dyna  by  Q-learning 


Suppose  that,  after  a  Dyna-PI  system  has  learned  the 
optimal  path  from  start  to  goal,  a  new  barrier  is  added 
that  blocks  the  optimal  path.  The  Dyna-PI  system 
described  above  will  run  into  the  block  and  then  try 
the  formerly  effective  action  many  hundreds  of  times. 
Eventually,  the  correct  new  path  may  be  found,  but 
the  process  is  very  slow.  It  seems  inappropriately  slow 
in  that  the  system’s  world  model  is  updated  immedi¬ 
ately.  Even  though  the  world  model  knows  that  the 
formerly  good  action  is  now  poor,  this  is  not  reflected 
in  the  system’s  behavior  for  a  long  time.  I  call  this  the 
blocking  problem. 

Part  of  the  problem  is  that  the  alternative  actions 
are  never  tried,  even  hypothetically,  because  the  policy 
assigns  them  a  probability  of  zero.  The  model  knows 
these  actions  are  better,  but  this  has  no  effect  unless 
they  are  tried.  One  idea  for  solving  this  problem  is 
to  allow  hypothetical  actions  to  be  selected  according 
to  a  more  liberal  policy  than  that  used  to  select  real 
actions.  The  simplest  case  of  this  is  that  in  which  hy¬ 
pothetical  actions  are  selected  at  random  uniformly. 
If  this  is  done,  a  small  adjustment  must  be  made  to 
the  evaluation  update  (Step  6).  Recall  that  the  eval¬ 
uation  function  is  supposed  to  represent  the  value  of 
each  state  given  the  current  policy.  If  hypothetical  are 
selected  uniformly,  then  the  bias  toward  the  current 
policy  must  be  introduced  explicitly.  To  do  this,  the 
evaluation  update  (Step  6),  on  hypothetical  steps  only, 
is  altered  to  be  weighted  by  the  current  action  prob¬ 
ability:  e(j:)  <—  e(x)  -1-  /?(r  -f  7e(y)  —  e(a;))P(a|a:).  In 
empirical  studies  we  have  indeed  found  this  to  be  an 
improvement  on  the  original  algorithm,  substantially 
improving  the  robustness  of  its  convergence  onto  opti¬ 
mal  behavior.  However,  this  does  not  solve  the  block¬ 
ing  problem:  the  system  still  takes  many  hundreds  of 
actions  into  an  added  barrier  before  finally  finding  a 
way  around  it. 

Now  consider  a  second  sort  of  change  in  the  envi¬ 
ronment.  Suppose,  after  the  optimal  path  has  been 
learned,  a  barrier  is  removed  that  permits  a  shorter 
path  from  start  to  goal.  The  simple  Dyna-PI  system 
introduced  above  is  unable  to  take  advantage  of  such 
a  shortcut,  it  never  wavers  from  the  formerly  optimal 
path  and  thus  never  discovers  that  the  former  obsta- 
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ing  to  improve  the  Dyna-PI  system  to  handle  blocks. 


we  might  also  seek  to  improve  it  to  handle  shortcuts. 


What  is  needed  here  is  some  way  of  continually  testing 


the  world  model.  In  the  next  section  we  introduce  a 


slightly  different  architecture  that  handles  both  kinds 
of  changes  with  little  increase  in  complexity. 


The  Dyna  PI  architecture  is  in  essence  the  reinforce¬ 
ment  learning  architecture  that  my  colleagues  and 
I  developed  (Sutton,  1984;  Barto,  Sutton  &  Ander¬ 
son,  1983)  plus  the  idea  of  using  a  learned  world 
model  to  generate  hypothetical  experience  and  to  plan. 
Watkins  (1989)  subsequently  developed  the  relation¬ 
ships  between  the  reinforcement-learning  architecture 
and  dynamic  programming  (see  also  Barto,  Sutton 

6  Watkins,  1989,  1990)  and,  moreover,  proposed  a 
slightly  different  kind  of  reinforcement  learning  called 
Q-leaming.  The  Dyna-Q  architecture  is  the  combina¬ 
tion  of  this  new  kind  of  learning  with  the  Dyna  idea  of 
using  a  learned  world  model  to  generate  hypothetical 
experience  and  achieve  planning. 

Whereas  the  original  reinforcement  learning  ar¬ 
chitecture  maintains  two  fundamental  memory  struc¬ 
tures,  the  evaluation  function  and  the  policy,  Q- 
Icarning  maintains  only  one.  That  one  is  a  cross  be¬ 
tween  an  evaluation  function  and  a  policy.  For  each 
pair  of  state  x  and  action  a,  Q-learning  maintains  an 
estimate  Qxa  of  the  value  of  taking  a  in  x.  The  value 
of  a  stale  can  then  be  defined  as  the  value  of  the  state’s 
best  state-action  pair: 

e(x)  maxQia- 
a 

In  general,  the  Q-value  for  a  state  x  and  an  action 
0  should  equal  the  expected  value  of  the  immediate 
reward  r  plus  the  discounted  value  of  the  next  state  y: 

Qxa  “  =  ”  £l{r  +  Te(y)  I  X,  a}.  (3) 

To  achieve  this  goal,  the  updating  steps  (Steps  6  and 

7  of  Figure  2)  are  implemented  by 

Qxo  Qxa  +  /?(r  -1-  7e(y)  -  Qxa)  ■  (4) 


This  is  the  only  update  rule  in  Q-learning.  We  note 
that  it  is  very  similar  though  not  identical  to  Hol¬ 
land’s  (1986)  bucket  brigade  and  to  Sutton’s  (1988) 
temporal-difference  learning. 

So  far,  the  Dyna-Q  architecture  is  slightly  simpler 
than  the  Dyna-PI  architecture.  Two  data  structures 
have  been  replaced  with  one  (which  is  no  larger  than 
one  of  the  original  two),  and  one  update  rule  and 
one  parameter  (q)  have  been  eliminated.  However, 
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determining  the  policy  from  the  Q-values,  as  we  dis¬ 


cuss  below.  One  advantage  of  Q-learning  is  that  it 


requires  no  special  adjustments  if  the  action  sele^-tion 


during  hypothetical  experience  is  different  from  the 


current  policy.  Watkins  (1989)  has  shown  that  the  Q- 
values  will  converge  properly  whatever  policy  is  used. 
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either  hypothetically  or  in  reality,  as  long  as  all  state- 
action  pairs  are  repetitively  tried.  In  the  following  ex¬ 
periments,  actions  were  selected  at  random  uniformly 
(Step  3)  on  hypothetical  experiences. 

The  simplest  way  of  determining  the  policy  on  real 
experiences  is  to  deterministically  select  the  action 
that  currently  looks  best — the  action  with  the  max¬ 
imal  Q-value.  However,  as  we  show  below,  this  ap¬ 
proach  alone  suffers  from  inadequate  exploration  and 
can  not  solve  the  shortcut  problem.  In  his  work  with 
Q-learning,  Watkins  implemented  the  policy  proba- 
balistically  using  a  Boltzmann  distribution:  P(a|a:)  = 
I  .  An  annealing  process  was  added  in 

which  a  tended  to  infinity  so  that  even  a  small  dif¬ 
ference  between  Q-values  would  eventually  lead  to  the 
best  action  being  selected  with  probability  one.  That 
approach,  however,  recreates  the  problem  of  loss  of 
variability  in  behavior  such  that  shortcuts  can  not  be 
found. 

To  deal  directly  with  the  shortcut  problem,  a  new 
memory  structure  was  added  that  keeps  track  of  the 
degree  of  uncertainty  about  each  component  of  the 
model.  For  each  state  x  and  action  a,  a  record  is  kept 
of  the  number  of  time  steps  rixa  that  have  elapsed  since 
a  was  tried  in  ®  in  a  real  experience.  The  square  root 
■t/risa  is  used  as  a  measure  of  the  uncertainty  about 
Qxa-^  To  encourage  exploration,  each  state-action  pair 
is  given  an  exploration  bonus  proportional  to  this  un¬ 
certainty  measure.  For  real  experiences,  the  policy  is 
to  select  the  action  a  that  maximizes  <5*o  +  ^^/nxa, 
where  e  is  a  small  positive  parameter.  This  method 
of  encouraging  variety  is  very  similar  to  that  used  in 
Kaelbling’s  (in  preparation)  interval-estimation  algo¬ 
rithm. 

However,  this  approach  alone  does  not  take  advan¬ 
tage  of  the  planning  capability  of  Dyna  architectures. 
Suppose  there  is  a  state-action  pair  that  has  not  been 
tested  in  a  long  time,  but  which  is  far  from  the  cur¬ 
rently  preferred  path,  and  thus  extremely  unlikely  to 
be  tried  even  with  the  exploration  bonus  discussed 
above.  In  a  Dyna  system,  why  not  expect  the  sys¬ 
tem  to  plan  an  action  sequence  to  go  out  and  test  the 
uncertain  state-action  pair‘d  If  there  is  genuine  uncer¬ 
tainty,  then  there  is  potential  benefit  in  going  out  and 
trying  the  action,  and  thus  forming  such  a  plan  is  sim¬ 
ply  rational  behavior  and  should  be  done.  It  turns  out 
that  there  is  a  simple  way  to  do  this  in  Dyna-Q.  The 
exploration  bonus  of  e^/n^a  is  used  not  in  the  policy, 

*The  use  of  the  square  root  is  heuristic  but  not  arbi¬ 
trary,  as  the  standard  deviation  of  all  stationary,  cumula¬ 
tive  random  processes  increases  with  the  square  root  of  the 
number  of  cumulating  steps. 


but  in  the  update  equation  for  the  Q-values.  That  is, 
(4)  is  replaced  by:® 

Qxa  *-Qxa  +  fi{r  +  €y/nZ  +  je{y)  -  Q^a) .  (5) 

In  addition,  the  system  is  permitted  to  hypothetically 
experience  actions  is  has  never  before  tried,  so  that 
the  exploration  bonus  for  trying  them  can  be  propa¬ 
gated  back  by  relaxation  planning.  This  can  be  done 
by  starting  the  system  with  a  non-empty  initial  model. 
In  the  experiments  with  Dyna-Q  systems  reported  be¬ 
low,  actions  that  had  never  been  tried  were  assumed 
to  produce  zero  reward  and  leave  the  state  unchanged. 

6  Changing- World  Experiments 

Experiments  were  performed  to  test  the  ability  of 
Dyna  systems  to  solve  blocking  and  shortcut  problems. 
Three  Dyna  systems  were  used:  the  Dyna-PI  system 
presented  earlier  in  the  paper,  a  Dyna-Q  system  in¬ 
cluding  the  exploration  bonus  (5),  called  the  Dyna- 
Q+  system,"*  and  a  Dyna-Q  system  without  the  explo¬ 
ration  bonus  (4),  called  the  Dyna-Q-  system.  All  sys¬ 
tems  used  fc  =  10.  For  the  Dyna-PI  system,  the  other 
parameters  were  set  as  in  the  navigation  experiment. 
For  the  Dyna-Q  systems,  they  were  set  at  =  0.5, 

7  =  0.9,  and  c  =  0.001. 

The  blocking  experiment  used  the  two  mazes 
shown  in  the  upper  portion  of  Figure  5.  Initially  a 
short  path  from  start  to  goal  was  available  (first  maze). 
After  1000  time  steps,  by  which  time  the  short  path 
was  usually  well  learned,  that  path  was  blocked  and  a 
longer  path  was  opened  (second  maze).  Performance 
under  the  new  condition  was  measured  for  2000  time 
steps.  Average  "esults  over  50  runs  are  shown  in  Fig¬ 
ure  5  for  the  hree  Dyna  systems.  The  graph  shows  a 
cumulative  record  of  the  number  of  rewards  received 
by  the  system  up  to  each  moment  in  time.  In  the  first 
1000  trials,  all  three  Dyna  systems  found  a  short  route 
to  the  goal,  though  the  Dyna-Q-f  system  did  so  signif¬ 
icantly  faster  than  the  other  two.  After  the  short  path 
was  blocked  at  1000  steps,  the  graph  for  the  Dyna-PI 
system  remains  almost  flat,  indicating  that  it  was  un¬ 
able  to  obtain  further  rewards.  The  Dyna-Q  systems, 
on  the  other  hand,  clearly  solved  the  blocking  problem, 
reliably  finding  the  alternate  path  after  about  800  time 
steps. 

’Note  that  this  differs  from  (4)  only  on  hypothetical 
experiences,  as  nxa  =  0  on  real  experiences. 

■‘In  these  experiments,  the  Dyna-Q-f  system  selected 
the  action  a  in  each  state  x  that  maximized  Qia+f  \/(nio), 
but  we  have  since  found  that  equally  good  performance  can 
be  obtained  simply  by  picking  the  action  with  maximal 
Qxa- 


222  Sutton 


™TFn  l-lilslj-lll 


Figure  5.  Average  Performance  of  Dyna  Systems  on  a 
Blocking  Task.  The  left  maze  was  used  for  the  first 
1000  time  steps,  the  right  maze  for  the  last  2000. 
Shown  is  the  cumulative  reward  received  by  a  Dyna 
system  at  each  time  (e.g.,  a  flat  period  is  a  period 
during  which  no  reward  was  received). 


Figure  6.  Average  Performance  of  Dyna  Systems  on  a 
Shortcut  Task.  The  left  maze  was  used  for  the  first 
3000  time  steps,  the  right  maze  for  the  last  3000. 
Shown  is  the  cumulative  reward  received  by  a  Dyna 
system  at  each  time  (e.g.,  the  slope  corresponds  to  the 
rate  at  which  reward  was  received). 


The  shortcut  experiment  began  with  only  a  long 
path  available  (first  maze  of  Figure  6).  After  3000 
times  steps  all  three  Dyna  systems  had  learned  the 
long  path,  and  then  a  shortcut  was  opened  without 
interferring  with  the  long  path  (second  maze  of  Figure 
6).  The  lower  part  of  Figure  6  shows  the  results.  The 
increase  in  the  slope  of  the  curve  for  the  Dyna-Q+  sys¬ 
tem,  while  the  others  remain  constant,  indicates  that 
it  alone  was  able  to  find  the  shortcut.  The  Dyna-Q+ 
system  also  learned  the  original  long  route  faster  than 
the  Dyna-Q-  system,  which  in  turn  learned  it  faster 
than  the  Dyna-PI  system.  However,  the  ability  of 
the  Dyna-Q+  system  to  find  shortcuts  does  not  come 
totally  for  free.  Continually  re-exploring  the  world 
means  occasionally  making  suboptimal  actions.  If  one 
looks  closely  at  Figure  6,  one  can  see  that  the  Dyna- 
Q-i-  system  actually  acheives  a  slightly  lower  rate  of 
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environment,  Dyna-Q-t-  will  eventually  perform  worse 
than  Dyna-Q-,  whereas,  in  a  changing  environment,  it 
will  be  far  superior,  as  here.  One  possibility  is  to  use 
a  meta-level  learning  process  to  adjust  the  exploration 
parameter  c  to  match  the  degree  of  variability  of  the 
environment. 


One  strength  of  the  Dyna  approach  is  that  it  ap¬ 
plies  to  stochastic  problems  as  well  as  deterministic 
ones.  We  have  explored  this  direction  in  recent  work, 
but  are  not  yet  ready  to  present  systematic  results. 
The  basic  idea  is  to  learn  a  model  which  predicts  not 
a  deterministc  next  state  and  ne.Kt  reward,  but  rather 
a  probability  distribution  over  next  states  and  next 
rewards.  In  the  simple  cases  we  have  explored,  this 
reduces  to  counting  the  number  of  times  each  possible 
outcome  has  occurred.  In  hypothetical  experiences, 
the  expected  value  on  the  right  of  (3)  is  then  estimated 
using  the  sample  statistics.  A  slightly  different  explo¬ 
ration  bonus  is  also  needed.  Promising  preliminary 
results  have  so  fai  been  obtained  for  simple  problems 
involving  random  autonomous  agents  and  stochastic 
state  transitions  (e.g.,  action  UP  takes  the  system  to 
the  state  above  80%  of  the  time,  and  to  a  random 
neighboring  state  20%  of  the  time). 

Further  results  are  needed  for  a  thorough  compari¬ 
son  of  Dyna-PI  and  Dyna-Q  architectures,  but  the  re¬ 
sults  presented  here  suggest  that  it  is  easier  to  adapt 
Dyna-Q  architectures  to  changing  environments. 
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7  Limitations  and  Conclusions 

The  simple  illustrations  presented  here  are  clearly  lim¬ 
ited  in  many  ways.  The  state  and  action  spaces  are 
small  and  denumerable,  permitting  tables  to  be  used 
for  all  learning  processes  and  making  it  feasible  for  the 
entire  state  space  to  be  explicitly  explored.  For  large 
state  spaces  it  is  not  practical  to  use  tables  or  to  visit 
all  states;  instead  one  must  represent  a  limited  amount 
of  experience  compactly  and  generalize  from  it.  Both 
Dyna  architectures  are  fully  compatible  with  the  use 
of  a  wide  range  of  learning  methods  for  doing  this. 

We  have  also  assumed  that  the  Dyna  systems  have 
explicit  knowledge  of  the  world’s  state.  In  general, 
states  can  not  be  known  directly,  but  must  be  esti¬ 
mated  from  the  pattern  of  past  interaction  with  the 
world  (Rivest  &  Schapire,  1987;  Mozer  and  Bachrach, 
1990).  Dyna  architectures  can  use  state  estimates 
constructed  in  any  way,  but  will  of  course  be  lim¬ 
ited  by  their  quality  and  resolution.  A  promising  area 
for  future  work  is  the  combination  of  Dyna  architec¬ 
tures  with  egocentric  or  “indexical-functional”  state 
representations  (Agre  &  Chapman,  1987;  Whitehead, 
1989). 

Yet  another  limitation  of  the  Dyna  systems  pre¬ 
sented  here  is  the  trivial  form  of  search  control  used. 
Search  control  in  Dyna  boils  down  to  the  decision  of 
whether  to  consider  hypothetical  or  real  experiences, 
and  of  picking  the  order  in  which  to  consider  hypo¬ 
thetical  experiences.  The  tasks  considered  here  are  so 
small  that  search  control  is  unimportant,  and  thus  it 
was  done  trivially,  but  a  wide  variety  of  more  sophisti¬ 
cated  methods  could  be  used.  Particularly  interesting 
is  the  possibility  of  using  Dyna  architectures  at  higher 
levels  to  make  these  decisions. 

Finally,  the  examples  presented  here  are  limited 
in  that  reward  is  only  non-zero  upon  termination  of 
a  path  from  start  to  goal.  This  makes  the  problem 
more  like  the  kind  of  search  problem  typically  studied 
in  AI,  but  does  not  show  the  full  generality  of  the 
framework,  in  which  rewards  may  be  received  on  any 
step  and  there  need  not  even  exist  start  or  termination 
states. 

Despite  these  limitations,  the  results  presented 
here  are  significant.  They  show  that  the  use  of  an 
internal  model  can  dramatically  speed  trial-and-error 
learning  processes  even  on  simple  problems.  Moreover, 
they  show  how  planning  can  be  done  with  the  incom¬ 
plete,  changing,  and  oftimes  incoirect  world  models 
that  are  contracted  through  learning.  Finally,  they 
show  how  the  functionality  of  planning  can  be  obtained 
in  a  completely  incremental  manner,  and  how  a  plan¬ 
ning  process  can  be  freely  intermixed  with  reaction  and 


learning  processes.  I  conclude  that  it  is  not  necessary 
to  choose  between  planning  systems,  reactive  systems 
and  learning  systems.  These  three  can  be  integrated 
not  only  into  one  system,  but  into  a  single  algorithm, 
where  each  appears  as  a  different  facet  or  different  use 
of  that  algorithm. 
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ABSTRACT 

Most  current  learning  and  planning  systems  have 
been  designed  to  function  well  in  an  environment 
which  is  a  model  of  the  real  world.  No  model  of 
the  world  can  be  perfect,  however.  For  a  system  to 
actually  be  able  to  learn  and  plan  with  the  real 
world  it  must  be  able  to  detect  problems  encoun¬ 
tered  while  acting  on  the  world  and  to  reconcile  its 
model  with  sensed  data.  We  have  constnicted  an 
explanadon-bascd  learning  system  called 
GRASPER  which  has  capabilities  to  monitor  ex¬ 
ecution  of  its  plans  and  to  tune  its  model  of  the 
world  through  use  of  explicit  approximations. 

This  paper  first  characterizes  the  different  kinds  of 
approximations  and  introduces  the  use  of  approxi¬ 
mate  mles  for  the  purpose  of  learning  uncertainty 
tolerant  plans.  Uncertainty  tolerant  plans  offer  the 
important  advantage  that  they  can  function  in  spite 
of  errors  rather  tlian  imposing  censors  which  re¬ 
strict  the  generality  of  a  plan.  The  key  issue  with 
uncertainty  tolerance  approximations  is  the  ability 
to  tune  them  when  tlic  system  encounters  failures. 

Our  approximations  for  uncertainty  tolerance  in¬ 
volve  tunablecontinuousquantities.  A  new  gener¬ 
al  algorithm  is  presented  which,  by  creating  quali¬ 
tative  representations  for  the  quantitative  behavior 
of  an  explanation-based  rule,  can  generate  expla¬ 
nations  as  to  how  to  increase  the  probability  of 
success  of  tlie  failing  expectation  througli  tuning 
various  approximate  quantities.  A  real-world  ex¬ 
ample  is  given  illustrating  the  tuning  process  for 
one  of  the  more  common  failures  occurring  with 
GRASPER  operating  in  the  robotics  grasping  do¬ 
main.  An  empirical  comparison  of  failures  ratc» 
for  tuning  and  non  tuning  runs  is  provided  in  the 
task  of  grasping  all  pieces  to  a  children’s  jnizzle. 

1.  Introduction 

Most  learning  and  planning  systems  to  date  have  func¬ 
tioned  in  isolation  from  the  real  world.  They  work  with  a 
simplified  representation  for  the  world,  usually  measuring 
success  by  the  ability  to  efficiently  produce  plans  which 
function  well  under  the  assumptions  of  that  representation. 
This  work  has  produced  many  invaluable  techniques  for  use 
in  learning  and  planning.  However,  one  componentmissing 
from  most  of  these  systems  is  a  mechanism  for  reconciling 
their  performance  in  the  simplified  world  w  ith  tire  real 


world.  Any  model  of  real-world  behavior  will  necessarily 
have  discrepancies  with  how  tlie  real  world  actually  be¬ 
haves.  Figure  1  shows  the  coirunon  problem  with  such  sys- 


Figure  1.  Discrepancies  Between  a  Model 
and  the  Real  World 

terns:  an  unrealized  discrepancy  between  their  model  and 
the  real  world. 

Explanation-based  learning  has  shown  promise  in  robot¬ 
ics.  In  Segre’s  ARMS  system,  knowledge  about  the  control 
of  a  robotic  manipulator  in  conjunction  with  geometric  ob¬ 
ject  knowledge  permitted  learning  assembly  plans  from  ob¬ 
servation  [Segre87].  Explanation-based  learning  permits 
general  plans  to  be  learned  from  a  single  example  [De- 
Jong86,  Mitchell86].  In  robotics,  a  sequence  of  robot  con¬ 
trol  primitives  is  used  as  the  example.  Attempting  to  use 
ARMS  to  control  a  real  robotic  manipulator  often  resulted 
in  failures  due  to  discrepancies  with  the  real  world.  This 
highlighted  the  need  for  a  meciianism  which  can  adapt  a 
system’s  model  of  the  world  to  the  real-  world  enviroiunent. 

We  are  currently  developing  a  system  called  GRASPER 
to  illustrate  the  use  of  explanation-based  learning  in  tlie  real 
world  IBennett89b].  It  seeks  to  learn  and  execute  real- 
world  plans  tractably  in  the  presence  of  uncertainty. 
Learned  plans  are  therefore  error  tolerant.  Most  existing 
systems  which  engage  in  learning  from  failure  impose  re¬ 
strictions  on  plans  when  failures  are  encountered.  Recent 
examples  include  H?>nmond’s  CHEF  [Hammond86]  and 
Chien’s  work  on  plan  refinement  [Chien89].  This  ensures 
that  the  plans  won't  fail  in  smular  circumstances  again  but 
c.in  decrease  tlieir  overall  generality.  Plan  generality  can  be 
rci.tinec  if  plans  arc  learned  which  function  in  spite  of  er¬ 
rors.  This  is  especially  important  in  real-world  eiivu-on- 
ments  where  iincertainty->'clatcd  errors  are  so  pervasive. 
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GRASPER  is  being  tested  in  a  robotic  grasping  domain 
where  it  can  act  on  the  world  through  control  of  a  robotic 
manipulator  and  can  acquire  data  with  a  visual  system  and 
position  and  force  sensors  associated  with  tlie  manipulator. 
We  arc  testing  system  perfomiance  at  grasping  relatively 
flat  pieces  from  a  puzzle  for  young  children.  Grasping  com¬ 
plex  shapes  such  as  these  is  a  difficult  ongoing  robotics  re¬ 
search  area  and  provides  an  ideal  testbed  for  our  learning  al¬ 
gorithms.  The  system  employs  explicit  approximations  in 
the  domain  theory.  It  is  tire  tuning  of  these  approximations 
with  experience  which  is  the  key  to  tlie  system.  This  paper 
focuses  on  how  approximate  rules  are  tuned  over  time  to  in¬ 
crease  the  uncertainty  tolerance  of  learned  plans.  First,  our 
approxiiTiation  terminology  is  introduced.  The  approxima¬ 
tion  tuning  algorithm  is  presented  next.  The  algorithm  is 
then  employed  on  a  robotics  grasping  example.  Lastly,  we 
present  experimental  results  comparing  tuning  algorithm 
error  rates  to  non-tuning  error  rates  in  executing  plans  on 
a  real-world  robotic  manipulator. 

2.  Approximations 

An  approximation  has  two  important  defining  features: 

Assumability 

An  approximation  must  make  some  statement  about  the 
world  based  not  on  logical  proof  but  on  conjecture, 

Tunability 

The  approximation  must  provide  a  method  by  which  it 
can  be  tuned  as  the  system  acquires  new  knowledge  and/ 
orits  goalschange.  The  tuning  method  should  allow  ad¬ 
justment  of  continuous  parameters  of  the  approximation 
to  decrease  its  error  with  respect  to  the  true  world  situa¬ 
tion.  The  tuning  method  may  accept  as  input  new 
knowledge  (obtained  from  sensor  readings)  which  facil¬ 
itate  this  adjusunent. 

Assumability  gives  an  approximation  its  efficiency  and 
tractability  advantage.  This  provides  a  justification  that  fur¬ 
ther  reasoning  need  not  be  done.  Tunability  indicates  that 
our  use  of  the  term  approximation  requires  tliat  they  be  a 
function  of  continuously  tunable  parameters.  By  making 
approximations  explicit,  rather  than  implicit  as  in  many  A1 
systems,  failures  resulting  from  tlie  approximations  can  be 
traced  back  to  inadequate  approximations.  In  much  of  the 
work  on  approximation,  the  focus  is  on  how  to  make  the  ap¬ 
proximations  initially.  This  is  an  important  issue  in  using 
approximations  to  improve  the  efficiency  of  one’s  knowl¬ 
edge.  Approximations  differ  from  single  binary  assump¬ 
tions  in  that  they  embody  a  notion  of  state  and  are  tuned 
from  tlieir  current  state,  not  simpiy  renacied  if  they  lead  to 
a  failure  as  with  Chien’s  approach  (Chieii89].  Since  ap¬ 
proximations  can  be  tuned  continuously,  they  also  differ 
from  the  discrete  number  of  possible  adjustments  available 
in  fixed  abstraction  hierarchies  like  that  of  Doyle 
[Doyle86].  But  in  using  approximations  to  deal  with  real- 
world  uncertainties,  the  goal  is  to  improve  the  uncertainty 


tolerance  of  one’s  knowledge.  The  unfortunate  reality  is 
that  everything  a  system  senses  from  the  outside  world  is  in 
some  sense  approximate  already.  To  put  this  in  perspective, 
in  the  following  two  subsections,  we  discuss  and  distinguish 
between  data  approximations  and  rule  approximations. 

2.1.  Data  Approximations 

Data  approximations  involve  representations  for  mea¬ 
sures  sensed  from  the  real  world.  The  raw  sensor  readings 
obtained  by  the  system  are  external  approximations.  That 
is,  they  can  only  be  tuned  through  interaction  with  the 
world.  If  a  range  sensor  relumed  the  distance  to  an  object, 
that  distance  would  be  externally  approximate.  However, 
one  could  imagine  performing  some  further  actions  in  the 
world,  such  as  using  tactile  sensors  to  feel  first  contact  with 
the  object,  so  as  to  tune  that  initial  approximation  for  dis¬ 
tance  to  the  object.  In  contrast,  an  internal  approximation 
can  be  tuned  by  the  system’s  own  reasoning  alone  without 
acting  on  the  world.  A  common  type  of  internal  data  ap¬ 
proximation  employed  by  our  system  is  to  rcvluce  the  com¬ 
plexity  of  visually  sensed  2-D  object  contours.  Such  con¬ 
tours  involve  hundreds  of  points  and  make  it  difficult  and 
slow  to  devise  grasping  points.  The  internal  data  approxi¬ 
mation  currently  employed  reduces  the  contour  to  a  n-gon 
witli  n  much  less  than  the  number  of  sensed  contour  points. 
Data  approximations,  both  external  and  internal,  are  cur¬ 
rently  pre-selected  for  the  domain.  The  tuning  of  such  ap¬ 
proximations  from  their  initial  values  is  the  key  problem 
here.  Our  use  of  internal  data  approximations  is  for  effi¬ 
ciency  not  uncertainty  tolerance. 

2.2.  Rule  Approximations 

Rule  approximations  are  employed  when  the  system  plans 
or  understands  how  a  goal  can  be  achieved.  They  affect  the 
way  the  system  interacts  with  the  world.  Consequently, 
these  approximations  are  useful  for  building  uncertainly 
tolerance  into  a  plan.  They  are  always  internal  approxima¬ 
tions,  capable  of  being  controlled  by  the  system.  Approxi¬ 
mation  techniques,  such  as  those  in  use  by  Keller  [Kel- 
ler87],  Zweben  [Zweben88],  and  others,  which  drop  rule 
preconditions,  are  like  ourrule  approximations  but  in  a  dis¬ 
crete,  not  a  continuous,  sense.  Theirapproach  to  improving 
efficiency  of  niles  through  approximations  is  complemen¬ 
tary  to  ours  as  both  efficiency  and  uncertainty  tolerance  are 
important  aspects  of  a  system’s  real-world  perfomiance,* 

This  paper  focuses  on  nile  approximations  for  the  purpose 
of  improving  the  uncertainty  tolerance  of  a  plan.  Here,  the 
nile  approximations  involve  achoice  for  continuous  param¬ 
eters  whose  adaptability  to  the  environment  is  desired.  The 
initial  approximate  mles  include  a  declaration  of  these  con¬ 
tinuous  approximate  quantities  as  well  as  a  set  of  predicates 
(antecedents  to  the  approximate  rules)  which  calculate  the 


1.  For  a  model  of  operationality  for  real-world  systems  which 
brings  together  efficiency,  uncertainty  tolerance  and  otlicr  factors  see 
(Bcmictt89a]. 


228  Bennett 


initial  values.  These  approximate  rules  are  ptc-defined  as 
part  of  the  domain  knowledge. 

We  have  identified  three  basic  types  of  rule  approxima¬ 
tions  and  employ  them  all  in  our  current  implementation: 
controls 

Tliese  rules  directly  choose  the  value  for  some  real- 
world  quantity.  For  example,  in  die  robotic  grasping  do¬ 
main,  chosen-opening-width  is  a  control  approxima¬ 
tion  which  chooses  the  amount  by  which  the  gripper 
should  open  for  achieving  the  grasp. 

constraints 

A  constraint  mle  is  used  for  restricting  a  choice  from 
among  a  set  of  candidates.  Each  constraint  nile  func¬ 
tions  as  part  of  a  multi-constraint  rating  mle  for  evaluat¬ 
ing  a  set  of  candidate  choices.  In  the  robotic  grasping 
domain ,  approximate  constraint  rules  are  used  in  choos¬ 
ing  the  best  faces  to  use  in  achieving  a  grasp.  Currently 
implemented  constraint  approximations  include  open¬ 
ing-width-constraint  and  contact-angle-constraint 
for  constraining  the  choice  of  faces  to  those  with  a  real¬ 
izable  opening  width  and  a  contact  angle  within  the  fric¬ 
tion  angle. 

weights 

Constraints  which  are  part  of  one  rating  mle  are  com¬ 
bined  using  weights  which  are  themselves  tunable  ap¬ 
proximations  represented  by  approx imate  mles.  Forex- 
ample,  opening-width-constraint  has  an  associated 
weight  opening-width-constraint-weight  used  in  eva¬ 
luating  it  relative  to  contact-angle-constraint. 

Next,  wc  present  an  algorithm  for  recognizing  which  ap¬ 
proximations,  among  the  declared  mle  approximations,  to 
tune  on  failure  and  how  to  tune  them. 

3.  A  General  Tuning  Algorithm 

Given  a  goal,  the  system  constmets  an  explanation  for 
how  the  goal  may  be  achieved.  This  can  be  accomplished 
in  eitlier  an  understanding  mode,  given  an  applied  operator 
.seniienre.  or  in  a  nlanninp  mofle.  where  the  oneralor  se- 

1  »  IV  *  '  ' 

quence  is  derived.  Rules  involved  in  constmeting  the  expla¬ 
nation  include  approximate  mles  as  outlined  above.  Most 
of  the  facts  employed  in  constructing  the  explanation  are 
data-approximate  having  been  derived  from  sensed  values 
from  the  real  world.  In  order  for  a  monitored  action  to  be 
achieved  in  the  explanation,  a  set  of  expected  sensor  values 
must  be  justified  by  a  further  subpart  of  the  explanation. 
The  overall  explanation  is  then  generalized  using  EGGS 
[Moouey86]  and  packaged  into  a  mle  as  with  standard  EBL 
s> stems.  When  the  mle  is  executed  in  th».  real  world,  those 
sensoib  described  in  the  monitored  actions  are  ob.scued.  If 
the  sensor  readings  observed  violate  the  curr-straints  dc- 
.seiibed  in  the  monitored  actions,  iilan  execution  has  failed.' 
Ill  till'  case,  the  subpart  of  the  original  explanation  which 
justified  the  expected  sensor  readings  is  sus()cct.  Clearly,  in 

2.  Approx/ir).itii)ri  Uiii'iig  in  om  sjslcin  is  driven  based  on  expeua- 
Mr.  fa  ,  L  Hii  ■  idea  lias  long  been  adv otalcd  as  a  Irigger  /or  Icaniiiig. 
See  {SebankSZ). 


the  approximate  model  of  the  world,  no  error  was  foreseen, 
otherwise  the  explanation  would  not  have  been  possible. 
This  suspect  subpart  of  the  original  explanation  is  the  start¬ 
ing  point  for  our  general  tuning  algorithm. 

The  tuning  algorithm  has  two  major  steps: 

1)  generate  a  qualitative  explanation  for  how  the  proba¬ 
bility  of  the  failed  expectation  can  be  increased 
through  tuning  of  mle-approximate  quantities  and 

2)  perform  the  actual  tuning  of  the  indicated  mle  approxi¬ 
mations. 


The  key  in  accomplishing  the  firststep  is  to  express  the  rela¬ 
tionships  between  generalized  variables  in  the  failing  sub¬ 
tree  as  qualitative  relations.  This  will  make  possible  quali¬ 
tative  proofs  which  relate  data-approximate  quantities, 
mle-approximate  quantities,  and  qualitative  probabilities 
of  success  of  the  various  predicates.  The  procedure  is  de¬ 
picted  graphically  in  Figure  2.  First,  the  sub-tree  of  the 
overall  explanation  which  supports  the  failed  expectations 
is  instantiated  with  the  gener¬ 
alized  binding  list  which 
EGGS  produced  for  the 


original  goal  explana¬ 
tion.  The  predicates 
at  tlie  root  and 
leaves  of  this  sub¬ 
tree  are  asserted 
in  a  new  situation 
as  qualitative  re¬ 
lations.  The  quantity  ar¬ 
guments  to  these  predi¬ 
cates  (which  are 
generalized  variables) 
become  quantities  in 
our  qualitative  model 
of  the  sub-proof. 
Any  data-approxi- 
male  or  rule-approxi¬ 
mate  quantities  which 
took  part  in  the  origi¬ 
nal  explanation  and 
whose  quantity  vari¬ 
ables  are  members  of 
the  set  of  generalized 
quantity  variables  for 


primary 

goal 

_explanation  in 
generalized  form 


4,\ 


siib-cxplanalion 
supporting  action  whose 
expectations  were  violated 
assert  root  and  leaves 
\  ^  of  failed  sub-tree  as 
^  qualitative  relations 


(pred  aigl  arg2  arg3 ...) 

I  quantities 

..  named  same 
as  generalized 
variables 


variable 


Figure  2.  Generating  the 

Qualitative  Model 

the  sub-proof  are  asserted  as  data-appi  oximate  and  tunable 
(luantities  respectively  m  the  current  situation.  Once  these 
facts  have  been  asserted  w  hich  iiertain  to  the  specific  proof 
tree,  the  goal  of  increasing  the  probability  of  the  root  predi¬ 
cate  to  the  sub- proof  can  be  proved  using  a  set  of  domain- 
independent  qualitative  rules. 

There  are  four  classes  of  domain  independent  qualitative 
rules  used  by  the  system  for  generating  the  qualitative  tun¬ 
ing  explanation: 
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general  qualitative  inference  rules 

These  are  rules  for  inferring  the  effects  of  changes  in 
quantities.  For  instance,  tlie  qualitative  proportionality 
predicate  (Q+  ?a  ?b)  asserts  that  the  magnitude  of  the 
quantity  ?b  directly  influences  the  magnitude  of  the 
quantity  ?a.  Therefore,  one  such  inference  rule  states 
that  to  achieve  the  goal  of  increasing  ?a  one  could  find 
a  quantity  ?b  for  which  ((2+  ?a  ?b)  holds  and  try  to 
achieve  the  goal  of  increasing  ?b. 


qualitative  predicate  definitions 

These  rules  provide  qualitative  representations  for  tire 
quantitative  predicates  employed  in  generating  expla¬ 
nations.  For  example,  the  predicate  ?q2  ?r)is 

used  by  the  system  for  taldng  the  difference  between 
two  quantities  (?ql  and  ?q2)  and  computing  tlie  result 
{?r).  One  of  several  rules  which  form  the  qualitative 
predicate  defi-  (mle  ;form 

(Q+?r?ql) 

:ants 

(qrelation  (dif  ?ql  ?q2  ?r)) 

that  tlie  magnitude  of  the  quantity  ?ql  directly  in¬ 
fluences  the  magnitude  of  the  result  ?r  in  a  6^i/predicate. 
These  definitions  and  the  general  qualitative  inference 
rules  described  above  are  similar  to  elements  of  Forbus’ 
Qualitative  Process  Tlieory  [Forbus84]. 


nition  for  the 
d// predicate  is 
shown  on  the 
right.  It  states 


approximation  definition  rules 

Approximate  quantities  have  properties  which  can  be 
expressed  in  a  qualitative  manner.  These  will  be  dis¬ 
cussed  in  more  detail  below. 


right  has  a  higher  probability  of  succeeding  given  the  illus¬ 
trated  probability  distributions  for  its  aiguments.  While 
probability  distributions  are  difficult  to  define  and  work 
with,  there  is  a  much  simpler  qualitative  view  of  the  proba¬ 
bility  distribution:  probability  density  decreases  as  one 
decreasin^>-p,^dccreasing 
central  \ 

value  ^  I 

moves  either  higher  or  lower  away  from  the  central  value. 
The  general  definition  for  a  data  approximation  embodies 
this  principle.  The  measured  quantity  is  taken  to  be  the  cen¬ 
tral  value.  Some  distribution  is  present  because  of  the  un¬ 
certainty  involved.  Without  knowing  any  details  about  the 
distribution,  the  definition  for  a  data  approximation  states 
that  the  probability  of  encountering  the  actual  value  for  the 
measured  quantity  decreases  as  we  get  farther  from  the 
measured  approximate  value.  One  of  the  approximation 
definition  mles  regarding  data  approximate  quantities  is 
shown  below 
(rule  :form 

(PQ-  (<  ?lest  ?loc)  ?test) 

;ants 

(data-approx-quantity  ?loc2) 

(IQ+  ?loc  ?loc2)) 

This  translates  to:  if  a  less-than  is  being  performed  between 
?test  and  a  quantity  ?loc  which  is  indirectly  or  directly  qual¬ 
itatively  proportional  to  a  data  approximate  quantity,  the 
probability  of  the  less-than  succeeding  is  inversely  propor¬ 
tional  to  the  magnitude  of  the  ?test  quantity. 


qualitative  probability  rules 

These  niles  relate  the  probabilities  of  success  of  predi¬ 
cates  in  a  way  similar  to  the  general  qualitative  infer¬ 
ence  rules,  ftoportionalities  can  be  declared  between 
the  probabilities  of  success  of  certain  pairs  of  predicates 
as  well  as  between  the  probability  of  success  of  a  predi¬ 
cate  and  the  magnitude  of  a  quantity.  Using  these  pro¬ 
portionalities,  goals  of  achieving  increases  or  decreases 
in  probabilities  of  success  can  be  achieved.  For  exam¬ 
ple,  the  probability  of  success  of  the  antecedent  to  a  nile 
is  declared  to  have  a  positive  influence  on  the  probabili¬ 
ty  of  success  of  the  consequent  of  a  rule. 


In  order  to  see  how  the  qualitative  tuning  explanation  is 
constructed  using  these  rules,  it  is  important  to  understand 
how  qualitative  probabilities  of  success  are  related  to  tun¬ 
able  quantities.  Quantitative  predicates  employed  by  the 
system  have  one  of  two  basic  intents.  Either  they  are  calcu¬ 
lation  predicates,  whose  purpose  is  to  compute  some  value 
(e.g.  the  rfi/function  discussed  earlier),  or  they  are  test pred- 
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(e.g.  the  less-than  function).  There  is  no  way  to  vary  the 
probability  of  success  of  a  calculation  predicate  since  they 
always  succeed.  A  test  predicate’s  probability  of  success, 
is  sensitive  to  the  probability  distribution  of  its  argument 
quantities.  In  the  diagram  below,  the  less-than  test  on  the 
(<ab)?  probability  (<ab)? 


Rules  like  this  effectively  translate  goals  to  increase  the 
probability  of  success  of  a  predicate  into  goals  to  increase 
or  decrease  quantities. 

In  general,  an  explanation  for  positively  influencing  the 
probability  of  a  predicate  proceeds  so  as  to: 

1 )  .’^elate  the  probability  of  the  failing  predicate  to  that  of 
a  test  predicate  involving  approximate  quantities 

2 )  use  the  definition  of  a  data  approximation  to  relate  the 
probability  of  success  of  a  test  predicate  with  the  mag¬ 
nitude  of  a  tunable  quantity 

To  guarantee  tliat  the  probability  of  tlie  failing  predicate  wilt 
increase,  all  the  test  predicates  in  the  nile  antecedents  must 
be  examined.  At  least  one  must  show  an  increasing  proba¬ 
bility  of  success  and  tlie  others  must  be  non-decreasing. 

The  tuning  explanation,  once  generated,  indicates  only 
which  quantities  to  tune  and  in  which  direction.  The  re¬ 
maining  task  IS  to  carry  out  the  tuning  of  the  nile  approxima¬ 
tions.  Figure  3  illustrates  three  possible  states  during  the 
ui  d 111!..  approAuiidtc quarmly.  Bomids  anu  increas¬ 
ing  or  decreasing  constraints  all  have  general  predicates  as¬ 
sociated  vv  ith  them  vv  Inch  allow  a  specific  numeric  v  aliie  for 
the  bound  to  be  derived  in  the  current  context.  The  ordeniig 
of  the  constraints  is  required  to  be  the  same  across  multiple 
contexts  in  which  the  rule  approximation  is  used.  Values  to 
the  left  of  a  lower  bound  or  to  tlie  right  of  an  upper  bound 
are  not  supported  by  explanations  in  the  approximate  model 
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of  tlie  world.  Tuning  takes  place  between  tliese  bounds 
tlirough  tlie  posting  of  increasing  and  decreasing  con¬ 
straints.  When  new  constraints  arc  posted  to  a  nile  approxi¬ 
mation  a  function  called  tlie  quality  function  is  re-calcu¬ 
lated.  The  quality  function  provides  an  estimate  of  which 
values  and  ranges  of  values  best  satisfy  the  current  con¬ 
straints  of  tlie  rule  approximation.  It  is  scaled  to  between 
-1  and  1  where  a  negative  value  means  tlie  current  set  of 
constraints  is  not  met  and  a  positive  value  indicates  they  are. 
The  magnitude  indicates  a  relative  rating  of  how  well  or 
how  poorly  the  constraints  are  met.  For  example,  for  a  rule 
approximation  witli  bounds  but  no  increasing  or  decreasing 
constraints,  the  function  is  flat  between  the  bounds  indicat¬ 
ing  there  is  no  preference  in  choosing  values  in  that  region. 

The  different  rule  approximation  types  described  earlier 
use  the  quality  function  in  different  ways.  A  constraint  ap¬ 
proximation  uses  the  quality  function  to  give  a  rating  for 
how  well  a  value  meets  the  constraints  in  the  current  con¬ 
text.  A  control  approximation,  on  the  other  hand,  uses  tlie 
quality  function  in  generating  a  set  of  apiiroxiniate  niles 
which  control  tlie  choice  for  its  controllable  quantity.  In 
generating  the  approximate  rule  set  of  a  control  approxima¬ 
tion,  preference  is  given  to  values  closer  to  the  current  value 
of  the  controllable  quantity.  For  tlie  unconstrained  case  of 
Figure  3  three  rules  arc  generated;  one  to  prefer  the  lower 
bound  when  the  current  value  is  less  than  the  lower  bound, 
one  to  keej)  the  current  setting  if  it  lies  between  the  bounds, 
and  one  to  use  the  upper  bound  if  the  current  value  is  greater 
than  the  upper  bound.  When  an  ex|)ensc  is  associated  with 
movement  of  a  control,  this  minimizes  the  expenditure.  In 
all  other  cases,  for  control  approxiniations,  the  rule  sets  the 
Unconstrained 
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control  to  correspond  with  tlie  peak  value  of  the  quality 
function. 

Therefore,  to  carry  out  the  tuning  as  prescribed  by  tlie 
qualitative  tuning  explanation,  a  new  constraint  is  posted  to 
the  appropriate  rule  approximation  for  each  suggested  tun¬ 
ing.  If  the  value  at  which  the  failure  was  suggested  was 
originally  generated  from  one  of  tlie  constraints  or  bounds 
of  the  approximation,  tlie  same  general  predicate  expres¬ 
sion  is  used  for  calculating  it  but  the  type  of  constraint  is 
changed  as  necessary.  When  constraints  need  to  be  posted 
between  sets  of  existing  constraints,  ^  new  general  expres¬ 
sion  is  created  using  the  general  expressions  for  the  two  sur¬ 
rounding  constraints  and  using  the  ratio  between  their  spe¬ 
cific  values  in  the  context  of  tlie  failure.  Once  the  new 
constraint  has  been  added  (or  the  old  constraint  changed) 
the  quality  function  is  re-computed  and  for  control  approxi¬ 
mations,  the  corresponding  approximate  rules  replaced  by 
a  new  set  which  reflects  the  new  quality  function. 

With  constraint  approxiniations,  another  decision  also 
must  be  made  before  tuning.  When  the  qualitative  tuning 
explanation  indicates  that  the  tunable  quantity  related  to  a 
constraint  apjiroxiniation  is  the  target  for  tiinuig,  it  is  possi¬ 
ble  that  the  current  constraint  approximation  had  no  say  in 
the  choice  that  failed.  This  is  because  constraint  approxi¬ 
mations  are  combined  using  approxunate  weights.  If  the 
quality  function  of  the  current  constraint  approximation  did 
give  the  selected  value  a  negative  rating,  the  associated 
weight  approximation  should  be  tuned  instead  of  the  con- 
.straint  itself.  This  serves  to  increase  the  relative  importance 
of  a  constraint  which  is  already  tuned  correctly.  Since 
weights  are  scaled  in  the  range  0  to  1 ,  this  amounts  to  cither 
tuning  the  indicated  weight  to  be  increased  from  the  current 
value  or  equivalently  tuning  all  the  weights  for  the  other 
constraints  employed  in  the  rating  function  to  be  decreased 
from  their  current  values  if  the  indicated  weight  is  already 
set  to  1. 


With  Increasing  Constraint 


With  Opposing  Constraints 

1  cjc-liiv function 


Figure  3.  Three  Possible  States  for  Constraints  on 
a  Rtilc  Approximate  Qtiantity 


4.  An  Example  Illustrating  the  Algorithm 


Our  testbed  for  learning  and  tuning  approximate  explana¬ 
tion-based  niles  is  the  robotic  grasping  domain.  The  sys¬ 
tem,  GRASPER,  is  implemented  in  Lucid  Common  Lisp 
niiining  on  an  IBM  RT125.  The  system  acquires  visual  ob¬ 


ject  contour  data  using 
an  MV  1  frame  grabber 
connected  to  an  over¬ 
head  motinted  CCD 
camera  (sec  Figure  4). 
The  system  controls  a 
UMI  RTX  scara-type 
robotic  manipulator 
equipjied  with  variable 
force  control  and  joint 
encoders  for  all  joints. 


Figure  4.  RTX  Setup 
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^tual  piece 
contour 


Initially,  the  system  uses  the  camera  to  acquire  contour  in¬ 
formation  about  objects  in  the  workspace.  These  contours 
are  tlien  approximated  with  n-gons  (internal  data  approxi¬ 
mations)  which  result  in  (n2-n)/2  possible  unique  grasping 
face  pairs.  In  runs  performed  here,  n  was  set  to  5.  The  data 
approximated  object  representations  as  well  as  the  current 
information  about  the  state  of  tlie  robot  manipulator  are  as¬ 
serted  in  the  initial  situation.  The  target  object  is  then  se¬ 
lected  and  an  explanation  is  generated  for  how  to  achieve  a 
grasp  of  the  target.  Figure  5  (automatically  drawn  by  the 
implementation)  shows  tlie  se¬ 
lected  target  object  with  the  vi¬ 
sually  sensed  contour  drawn 
witli  a  heavy  line.  The  light 
pentagon  is  the  data  approxi¬ 
mation  for  the  object  Thear- 
rows  indicate  the  positions  of 
tire  leading  edges  of  the  fin-  '  df.aapproxi'naUon 
gers  for  the  grasp  position  of  contour 

given  by  the  produced  expla-  Figure  5.  Grasp  Target 
nation.  The  explanation  for  achieving  grasp-object  in¬ 
volves  a  total  of  about  300  nodes  with  a  maximum  depth  of 
10  levels.  This  explanation  was  produced  using  the  approx¬ 
imations  in  their  initial  state  bc‘''>re  tuning.  The  approxi¬ 
mate  rule  employed  in  the  explanation  for  deciding  the  grip¬ 
per  opening  width  to  choose  once  tlie  grasping  faces  had 
been  selected  is  shown  in  Figure  6.  This  rule  affected  tlic 
separation  of  the  arrows  shown  in  Figure  5.  After  the  expla¬ 
nation  was  generated,  and  its  associated  operator  sequence 
executed,  the  monitored  act:  in  shown  in  Figure  7  encoun¬ 
tered  a  violation  of  the  expected  sensor  readings. 

The  original  explanation  for  the  no-gripper-collision-ob- 
jeci  goal  indicated  in  the  above  monitored  action  is  now  sus- 


peetdue  to  the  violated  expectations.  A  sketch  of  the  specif¬ 
ic  explanation  is  shown  in  Figure  8.  This  explanation  for 

<S  (NO-GRIPPER-COLLISION-OBJECI'  GRIPPERl 
^FINGERl  263  180  0  40  OBJECT148) 

(LEFT-FINGER-OF  GRIPPERl  FINGERl) 

1  WoN-INTERSECTING-GRlPPER-FINGER-OBJECT 
I  ^RIPPERl  FINGERl  263  180  0  40  OBJECT  148) 

I  ^6  Subproof  for  translating  finger  to  appropriate  opening 
1  \\  width  (6  facts,  8  built-ins) 

I  \o  Subproof for  counter-rotating  object  center  for 
1  \  clipping  againstfinger  (8  built-ins) 

I  ®  Subproof for  calculating  extents  and  checking  for 
I  overlap  (7  built-ins) 

I  (RIGHT-FINGER-OF  GRIPPERl  FINGER2) 

i(NON-INTERSECnNG-GRIPPER-FINGER-OBJECT 
^GRffPERl  FINGER2  263  180  0  40  OBJECT148) 

Vb  Subproof  for  translating  finger  to  appropriate  opening 
U  width  (6  facts,  8  built-ins) 

\b  SHARED  Subproof  for  counter-rotating  object  center 
\  for  clipping  against  finger  ( 8  built-ins) 

6  Subproof  for  calculating  extents  and  checking  for 
overlap  (7  built-ins) 

Figure  8.  Explanation  Specific  to  Failure 
why  no  external  force  should  have  been  sensed  during  tlie 
downward  move  the  gripper  is  the  starting  point  for  de¬ 
veloping  the  .uialitabve  tuning  explanation,  ’’’he  general¬ 
ized  consequents  and  antecedents  of  the  no  ^  ipper-colli- 
sion  subproof  are  asserted  as  qualitative  relations. 
Approximate  quantibes  employed  in  tlie  subproof  are  iden¬ 
tified  and  asserted  as  such.  A  proof  is  then  constructed  for 
increasing  the  probability  of  success  of  tlie  no-gripper-col¬ 
lision-object  goal.  Figure  9  shows  the  qualitative  explana¬ 
tion  for  how  opening  the  gripper  (increasing  the  opening- 
width  tunable  quantity)  positively  influences  the 


INTRA-RULE.  R190-* —  oiie  of  three  rules  whicli  are  initially  defined  by  tlie  opening  width  nilc  approximation 
FORM:  (CHOSEN-OPENING-WIDTII?GRIPPER?X?Y7ANGLE?OBJECr?RETURN) 

ANTS:  (GRIPPER-OPENING  7GRIPPER  7LOPI 87)  find  minimum  required 

(GRIPPER-PERP-WIDTII 7GRIPPER  7SPAN)  Opening  SO  fingers  don’t 

(MIN-SPAN-FOR-OBJECT 70BJECr  7X  7Y  7ANGLE 7SPAN  7LEFT  7RIGIIT)  ■^mlliVIP  with  nhiprt  in 

(SUM  7LEFr  7RIGIfr  7RErURN)  inmvimnfP  mnilnl 

(MAX-GRIPPER-OPENING  7GRIPPER  7MAX-OPEN)  approxmiaie  mooei 

(<-  7RETORN  7MAX-0PEN)-^ —  caii’t  do  it  even  ill  approximate  model  if  too  wide  for  gripper 


(<7LOP187  7RETURN) 
APPROX:  CHOSEN-OPENING-WIDTII 


iPENiNG-wiDTii  —  pointer  to  parent  rule  apiiroximation 
Figure  6.  One  of  the  Initial  Apiiroximatc  Rules  For  Opening  Width 


(MONITOR  (MOVE-ZED  7GRIPPERI(kiOI  DOWN  20  64  20  POSITION)  move  down 
(AND  (POSITION  ZED  7ZPOSI5923)  —  force  position  to  bc  recorded 
(FORCE  ZED  7ZFOR 15924)  ^  ,  . 

(<  7ZFOR1S924  30))  aH  Sensed  forces  on  this  joint  must  be  less  than  30  units 

(POSITION  ZED  7LEVEL15925)  —  temiiiiate  wlicii  position  is  0  (at  the  table) 

NIL7DOCI5926  iustificatioii  for  sensor 

(N(0-GRIPPER-COLLISION-OBJECr7GRIPPER10001  7X16235 

7Y 16236  7ANGLE16237  7WIDTH 16238  70BJECIT6393))  CXpcciatlOIlS 

■  variables  bound  by  antecedents  to  grasp  rule 
Figure?.  The  Failing  Monitored  Action 
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lest  predicate  associated  with  finger! 
collision  test  \ 

(P+  NGC  (<=  MAX2575  MINI  572)) 


(PS-JNCNGQ 


all  quantities  are  named  using  variable 
names  from  the  general  rule 


(PS-INC  (<=  MAX2575  MIN1S72)) 

(ANTECEDENT-Of'a'GC(<=  MAX2575  MIN1572)) 

(PQ+  (<-  MAX2575  MIN1572)  MIN1572) 


(INCREASE  MIN1572) 


(APPROX-QUANTITY  OBX473) 


(IQ+  MAX2575  OBX473) 


(QRELATION  (POSITION  OBJECr467  OBX473  OBY474))  (TUNABLE  WIDTH466) 

(IQ+  MIN1572  WIDTH4€6) 

(DATA-APPROXIMATION  (POSITION  OBJECT467  OBX473  OBY474)  OBX473) 


j  Where  NGC  represents  the  failing  predicate:  j 

Figure  9.  A  Qualitative  Tuning  Explanation 


probability  tliat  there  will  be  no  collision  between  the  first 
gripper  finger  and  the  object.  Table  1  gives  the  semantics 

Table  1.  Predicates  Employed  in  the  Tuning 
Explanation  of  Figure  9 

(PS-INC  ?prcd) 

the  probability  of  success  of?pred  is  influenced  positively 
(P+?predl  ?prcd2) 

the  probability  of  success  of?pred2  influences  the 
probability  of  success  of?predl  positively 
(ANTECEDENT-OF?predl  ?pred2) 

?pred2  is  an  antecedent  of?predI  in  the  rule  being  analyzed 
(PQ+  ?prcd  ?quant) 

the  magnitude  of  the  quantity  ?quant  influences  the 
probability  of  success  of?pred  positively 
(INCREASE  ?quant) 

the  magnitude  of  the  quantity  ?quant  is  influenced  positively 
(APPROX-QUANTITY  ?quant) 

?qant  is  an  approximate  quantity  from  a  data  approximation 
(not  controllable  by  the  system) 

(IQ+?ql?q2) 

the  magnitude  of  quantity  ?q2  indirectly  influences  the 
magnitude  of  quantity  ?ql  positively 
(TUNABLE  ?quant) 

the  magnitude  of  quantity  ?quant  is  a  tunable  quantity 
from  a  rule  approximation  (controllable  by  the  system) 

for  the  predicates  employed  in  the  explanation.  The  top¬ 
most  left-hand  subtree  establishes  that  the  jirobability  of  the 
<=  test  predicate  can  influence  the  probability  of  the  no¬ 
gripper-collision  goal  because  it  is  an  antecedent  of  the 
rule.  The  right-hand  subtree  establishes  that  the  probability 
of  the  <=  can  be  positively  influenced  through  an  increase 
in  the  opening  width.  Tlie  IQ+  predicate  is  a  built-in  predi¬ 
cate  foresiablisliing  transitive  relations  between  quantities. 
It  consults  the  body  of  qualitative  proportionalities  which 
hold  in  the  current  situation.  There  is  a  similar  suiiporting 


explanation  for  the  other  finger,  which  moves  oppositely 
while  opening.  Together,  the  two  snbproofs  cover  all  tlie 
test  predicates  employed  in  the  original  explanation  and 
thus  guarantee  that  openhig  the  gripper  wider  decreases  the 
chance  of  this  failure  (with  the  target  object)  happening  in 
the  future. 

The  qualitative  tuning  explanation  indicates  that  the  cho- 
sen-opening-width  mle  approximation  should  be  tuned. 
Namely,  that  an  increasing  constraint  be  posted  at  the  mini¬ 
mum  opening  wid»h,  which  was  chosen  in  the  failure.  Fig¬ 
ure  10  illustrates  the  chosen-opening-width  mle  approxi¬ 
mation  before  (top)  and  after  (bottom)  tuning  has  occurred. 
After  tuning,  tlie  rules  associated  with  the  mle  approxima¬ 
tion  are  redefined.  Afterwards,  the  single  new  mle  asso¬ 
ciated  with  this  approximation  reads: 


c 

hosen-Opening- Width  (Initial 

1  1 

1) 

quality  fmiction 

_ 

r  71 

1  widdi  of  l.Trgct  object  /  | 

lI_ 

min(distancc  to  nearest  object, 
max-opening-widlli) 


Chosen-Opening- Width  (After  Tuning) 

1  _ _ — ■'1  quality  ftinction 


— 

_I - ’  posted  constraint: 

prefe-  greater  tiian 
t.argct  object 


min(distancc  to  nearest  object, 
max-opcning-widtli) 


Figure  10.  The  Chosen-Opening-Width  Rule 
Approximation  Before  and  After  Tuning 


Reducing  Real-world  Failures  of  Approximate  Explanation-based  Rules 


INTRA-RULE:  R30278 
FORM: 

(CHOSEN-OPENING-WIDTH  7GRIPPER7X  ?Y 
7ANGLE  ?OBJnCT  7RETURN) 

ANTS: 

(GRIPPER-PERP-WIDTH  7GRIPPER  7SPAN) 
(DISTANCE-IO-CLOSEST-OBJECT  70B3ECT  7X  7Y 
7ANGLE  7SPAN  7RADIUS) 
(GRIPPER-nNGER-PARALLEL-WIDTH?GRIPPER 
7PSPAN) 

(DIF  7RADIUS  7PSPAN  7NRADIUS) 
(MAX-GRIPPER-OPENING  7GRIPPER  7MAX-OPEN) 
(MIN  7NRADIUS  7MAX-OPEN  7RETURN) 
(MIN-SPAN-FOR-OBJECr  70BJECT  7X  ?Y  7ANGLE 
7SPAN  7LEFT7RIGHT) 

(SUM  7LEFT  7RIGHT  7MIN) 

(<-  7MIN  7RErjRN) 

APPROX:  CHOSEN-OPENING-WIDTH 


This  rule  prefers  selection  of  the  peak  of  the  newly  re-calcu¬ 
lated  quality  function  which  corresponds  to  opening  as  wide 
as  tlie  current  situation  permits. 

5.  Experimental  Results 

The  GRASPER  system  was  given  the  task  of  achieving 
equilibrium  grasps  on  the  1 2  smooth  plastic  pieces  of  a  chil¬ 
dren’s  puzzle.  Figure  12  shows  the  gripper  and  several  of 
the  pieces  employed  in  these  experiments.  A  random  order¬ 
ing  and  set  of  orientations  was  selected  for  presentation  of 
the  pieces.  Target  pieces  were  also  placed  in  isolation  from 
other  objects.  That  is,  the  w  orkspace  never  had  pieces  near 
enough  to  the  grasp  target  to  mipinge  on  the  decision  made 
for  grasping  the  target.  The  first  niii  was  performed  with  ap¬ 
proximation  tuning  turned  off.  Tlie  results  are  illustrated  in 
Figure  11.  Failures  observ  ed  during  this  run  includcd/lnger 
stubbingfailures  w  here  a  gnpirer  finger  struck  the  toji  of  the 
object  while  moving  down  to  surround  it  and  lateral  slip¬ 


ping  failures  where,  as  tlie  grippers  were  closed,  the  object 
slipped  out  of  grasp,  sliding  along  the  table  surface.  In  tlie 
system’s  initial  approximate  representation  for  the  world, 
the  choice  of  grasping  faces  is  constrained  only  by  the  grip¬ 
per  being  able  to  open  wide  enough  to  surround  them  and 
that  an  equilibrium  grasp  is  realizable  vvitli  the  current  grip¬ 
per-object  friction  coefficient  (initially  1  here).  Since  a 
friction  coefilcient  of  1  is  likely  to  be  high  for  these  materi¬ 
als,  the  choice  of  contact  faces  is  likely  to  be  under-con¬ 
strained  initially,  resulting  in  slipping  failures.  The  choice 
of  opening  widUi  is  tlie  minmnimdeviation  from  tlie  current 
opening  width  (initially  0  here)  which  satisfies  the  approxi¬ 
mate  model  of  the  grasp.  Due  to  uncertainties  in  the  world, 
this  approximate  opening  width  may  often  result  in  stub¬ 
bing  failures.  There¬ 
fore,  tlie  error  rate  of 
our  initial  approxi¬ 
mate  plan  was  high  re¬ 
sulting  in  9  finger 
stubbing  failures  and 
1  lateral  slipping  fail¬ 
ure  in  12  trials. 

In  our  second  run, 
approximation  tuning 
was  turned  on.  An  ini¬ 
tial  stubbing  failure  Figure  12.  Gripper  and  Pieces 
on  trial  1  led  to  a  tuning  of  the  chosen-opening- width  rule 
approximation  w  hich  determines  how  far  to  open  for  the  se¬ 
lected  grasping  faces.  Since  the  generated  qualitative  tun¬ 
ing  explanation  illustrated  that  opening  wider  would  de¬ 
crease  the  chance  of  this  tyix:  of  a  failure,  tlic  s>  stem  tuned 
the  approximate  nile  to  choose  the  largest  opening  width 
possible  (constrained  only  by  the  niaximuni  gripiier  oiieii- 
ing  and  possible  collisions  vvilli  nearby  objects).  In  the  case 


1  2  3  4  5  6  7  8  9  10  II  12 

Trials  With  Tuning 


1 


Figure  II.  Comparison  of  Tuning  to  Non-tuning  in  Grasping 
the  Pieces  of  a  Puzzle 
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of  isol£  ted  grasp  targets,  opening  to  the  maximiun  gripirer 
width  is  preferred.  In  trials  2  and  3,  finger  stub  failures 
didn’t  occur  as  they  had  previously  because  tlie  opening 
width  as  greater  than  tlie  object  width  for  tliat  orientation. 
Vertical  slipping  failures,  w  hich  die  current  implementation 
does  not  currently  have  knowledge  about,  did  occur.  The 
system  had  to  be  told  that  a  vertical  slipping  failure  had  oc¬ 
curred  instead  of  the  lateral  slipping  failure  it  thought  had 
occurred.  This  is  because,  withoutfurtlierknowledgeaboiit 
vertical  slipping  failures  and  a  means  for  detecting  them, 
they  look  in  otlier  w  ays  (the  force  v  s.  position  profile  of  the 
gripper  closing)  like  a  lateral  slipping  failure.  Preventing 
vertical  slipping  failures  involves  knowing  shape  informa¬ 
tion  along  tlie  height  dimension  of  the  object,  which  we  plan 
to  give  in  the  future  using  a  model-based  vision  approach. 
In  trial  5,  a  lateral  slipping  failure  is  seen  and  the  qualitative 
tuning  explanation  suggests  decreasing  the  contact  angle 
betw  eeu  selected  grasping  surface  tlirough  tuning  the  con- 
tact-angle-cotistraint  approximation.  This  is  tuned  to  pre¬ 
fer  smaller  contact  angles.  A  single  tuning  for  tlie  finger 
stubbing  and  lateral  slipping  failures  was  sufficient  toclimi- 
nate  those  failures  with  isolated  grasp  targets. 

6.  Future  Work  &  Conclusions 


Furtlier  experiments,  areinprogress  w'ith  the  systematthis 
time.  One  involves  investigating  approximation  tuning  in 
environments  w  here  the  target  object  is  nearby  to  other  ob¬ 
jects.  This  type  of  situation  will  illustrate  tr<ideoffs  between 
constraints  on  an  approximation.  For  instance,  because  of 
the  close  proxhiiity  of  other  objects,  failures  may  occur  be¬ 
cause  opening  to  the  maximum  width  (as  was  learned  here) 
results  in  a  collision  with  a  nearby  object.  We  hope  to  dem¬ 
onstrate  that  the  aiiproximalions  tunc  themselves  appropri¬ 
ately  for  different  characteristics  of  the  environment,  such 
as  density  of  objects. 

We  are  also  in  the  process  of  iniplementiiig  a  general 
mechanism  for  rating  the  jilausibility  of  different  possible 
failures  based  on  the  qualitative  view  of  an  explanation  dis¬ 
cussed  here.  By  isolating  the  cause  of  the  failure  more  pre¬ 
cisely  within  the  failed  ex|)ectation  explanation,  through  the 
plausibility  rating  mechanism,  qualitative  tuning  cxplaiia 
tions  arc  made  easier  to  generate. 

Any  system  which  hopes  to  manage  a  model  of  the  world 
in  conjunction  with  actions  and  values  sensed  in  the  real 
world  has  to  have  some  apiiroximatioii  mechanisni.  The 
characterization  of  data  and  rule  approximations  provides 
a  good  framework  from  which  to  explore  how  to  mainagc 
approximations.  Tunable  approximate  quantities  are  suffi 
Cicnlly  powerful  to  uitrociucc  uncertsimy  toicrsncc  into 
plans  in  an  on-demand  manner.  Uncertainty  tolerance  al¬ 
lows  plans  to  function  in  spite  of  the  type  of  data  errors 
which  leads  to  failures.  The  approximation  tuning  method 
presented  offers  a  general  w  ay  of  discovering  the  relation¬ 
ships  between  tlie  tunable  ajiproximate  quantities,  data  ap¬ 
proximate  quantities  measured  from  the  world,  and  ulti¬ 


mately  the  probability  of  success  of  a  goal.  The 
exiierimental  results  indicate  tlie  method  of  approximation 
tuning  decreases  the  overall  failure  rates  for  real-world  do¬ 
mains  as  compared  w  ith  a  static  approach  w  hich  must  con¬ 
tinue  to  Icam  and  explain  w  ithin  its  fixed  declared  represen¬ 
tation  of  the  world. 
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Abstract 

Analytic  leeuning  techniques,  such  as  expla¬ 
nation-based  learning  (EBL),  can  be  power¬ 
ful  methods  for  acquiring  knowledge  about  a 
domain  where  there  is  a  pre-existing  theory 
of  the  domain.  One  application  of  EBL  has 
been  to  learning  apprentice  systems  where 
the  solution  to  a  problem  generated  by  a  hu¬ 
man  is  used  as  input  to  the  learning  process. 

The  learning  system  analyzes  the  example 
and  is  then  able  to  solve  similar  problems 
without  outside  assistance.  One  limitation  of 
EBL  is  that  the  domain  theory  must  be  com¬ 
plete  and  correct.  In  this  paper  we  present 
a  general  technique  for  learning  from  outside 
guidance  that  can  correct  and  extend  a  do¬ 
main  theory.  In  contrast  to  hybrid  systems 
that  use  both  analytic  and  empirical  tech¬ 
niques,  our  approach  is  completely  analytic, 
using  the  chunking  learning  mechanism  in  the 
Soar  architecture.  This  technique  is  demon¬ 
strated  for  a  block  manipulation  task  that 
uses  real  blocks,  a  Puma  robot  arm  and  a 
camera  vision  system. 

1  Introduction 

Learning  through  interacting  with  a  human  is  an  ef¬ 
ficient  method  to  increase  the  knowledge  of  an  intel¬ 
ligent  agent.  Initially,  an  intelligent  agent  may  have 
only  very  general  abilities  and  may  require  significant 
guidance  from  a  human  operator.  Through  its  ex¬ 
periences,  the  agent  can  become  more  and  more  au¬ 
tonomous,  making  most  decisions  on  its  own  and  re¬ 
quiring  guidance  only  for  novel  situations.  It  can  in¬ 
crease  its  repertoire  of  methods  for  solving  problems, 
improve  its  reaction  iiintj  to  eventa  iii  the  environnient, 
learn  to  notice  new  properties  of  objects  in  the  envi¬ 
ronment,  as  well  as  refine  and  extend  its  own  model  of 
the  domain. 

*Tliis  research  was  sponsored  by  grant  NCC2-517  from 
NASA  Ames  and  ONR  grant  N00014-88-K-0554. 


Many  approaches  are  possible  for  learning  from  ex¬ 
perience  and  outside  guidance.  In  the  simplest  case, 
the  human  solves  a  problem  rind  the  learning  system 
must  “watch  over  the  shoulder”  of  the  human  as  the 
problem  is  solved.  This  is  the  scheme  used  in  robotic 
programming  systems  where  a  human  leads  the  sys¬ 
tem  through  a  fixed  set  of  commands  to  achieve  a  goal. 
When  the  commands  are  stored,  the  system  can  per¬ 
form  only  that  one  task  and  there  is  no  conditionality 
in  the  learned  plan.  The  robot  will  execute  exactly  the 
learned  set  of  actions,  independent  of  the  state  of  the 
environment. 

To  avoid  these  problems,  “learning  apprentices” 
have  been  developed  that  create  generalized  plans,  in¬ 
dexed  by  the  appropriate  goal.  These  systems,  such 
as  LEAP  [Mitchell  et  ai,  1985]  are  based  on  a  learn¬ 
ing  strategy  called  explanation-based  learning  (EBL) 
(DeJong  &  Mooney,  1986;  Mitchell  ei  al,  1986).  In 
a  learning  apprentice  system,  the  human  provides  a 
complete  solution  to  a  problem.  An  underlying  “do¬ 
main  theory”  is  then  used  to  “explain”  the  actions  of 
the  human  expert.  From  the  derived  explanation,  all 
of  the  dependencies  between  the  actions  are  recovered 
and  a  general  plan  is  created. 

Two  extensions  to  this  basic  model  of  external  guid¬ 
ance  have  been  previously  made  using  the  Soar  ar¬ 
chitecture  which  uses  an  analytic  learning  mechanism 
similar  to  EBL,  called  chunking  [Golding  ei  al,  1987; 
Laird  et  ai,  1987;  Rosenbloom  &  Laird,  1986]: 

1.  The  learning  through  guidance  is  integrated  with 
general  problem  solving  and  autonomous  learn¬ 
ing.  The  system  can  learn  from  its  own  experi¬ 
ences  with  or  without  outside  guidance. 

2.  The  system  actively  seeks  advice  while  solving  its 
own  problems  as  opposed  to  passively  monitoring 
a  human  problem  solver.  In  addition,  the  guid¬ 
ance  occurs  in  the  context  of  the  system  solving 
a  problem  and  the  guidance  is  at  the  level  of  in¬ 
dividual  decisions  instead  of  complete  plans. 

In  this  paper  we  will  demonstrate  that  this  tech¬ 
nique  can  be  extended  to  learning  new  domain  knowl¬ 
edge,  not  just  control  knowledge.  The  tasks  we  will  use 
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involve  simple  manipulations  of  blocks  using  a  Puma 
robot  arm  and  a  single  camera  machine  vision  sys¬ 
tem.  This  a  simpliflcation  of  the  manipulator  con¬ 
trol  tasks  performed  by  Segre’s  ARMS  system  [Segre, 
1987];  however,  ARMS  worked  only  in  a  simulated  en¬ 
vironment.  With  the  real  robot,  the  domain  theory 
is  complicated  by  incomplete  and  time-dependent  per¬ 
ception;  the  camera  provides  only  two-dimensional  in¬ 
formation  and  is  sometimes  obscured  by  the  robot  arm, 
and  the  vision  processing  time  is  4  seconds.  The  goal 
of  the  task  is  to  line  up  a  set  of  blocks  that  have  been 
scattered  over  the  work  area.  In  one  task,  all  of  the 
blocks  are  simple  cubes  that  the  gripper  can  pick  up  in 
two  different  orientations.  In  the  second  task,  one  of 
the  blocks  is  a  triangular  prism.  The  gripper  is  unable 
to  pick  up  the  prism  when  it  closes  over  the  inclined 
sides.  Instead  it  must  be  oriented  so  that  it  closes  over 
the  vertical  faces  of  the  block. 

One  restriction  on  using  analytic  methods  such  as 
EBL  and  chunking  is  that  they  require  a  complete  and 
correct  domain  theory.  The  domain  theory  is  an  in¬ 
ternal  model  of  all  of  the  preconditions  and  effects  of 
the  operators  used  to  perform  the  tasks.  For  the  block 
manipulation  task,  the  operators  are  commands  to  the 
robot  controller  such  as  open  gripper,  close  gripper, 
and  withdraw  gripper  from  work  area.  If  the  internal 
model  of  these  operators  is  in  some  way  incorrect,  then 
the  control  knowledge  learned  through  guidance  will 
also  be  incorrect.  For  example,  if  the  domain  knowl¬ 
edge  is  not  sensitive  to  the  special  properties  of  a  prism 
block,  the  control  knowledge  it  learns  will  ignore  the 
orientation  of  a  prism  block  when  attempting  to  pick 
it  up.  We  will  show  how  it  is  possible  to  correct  con¬ 
trol  knowledge  using  outside  guidance  and  a  domain 
theory  for  determining  the  relevant  features  of  the  en¬ 
vironment  and  relating  them  to  robot  actions. 

Even  if  the  learned  control  knowledge  can  be  cor¬ 
rected,  the  original  domain  theory  may  still  be  in¬ 
correct.  We  show  that  our  system  can  correct  the 
original  domain  theory  by  replacing  incorrect  oper¬ 
ator  definitions  with  new,  corrected  definitions,  and 
that  it  can  extend  the  domain  theory  by  creating  com¬ 
pletely  new  operators.  In  both  cases,  no  modification 
is  made  to  the  underlying  architecture;  instead  knowl¬ 
edge  is  added  to  correct  and  create  operators.  In  the 
block  manipulation  task,  we  will  demonstrate  the  ac¬ 
quisition  of  planning  and  execution  knowledge  for  the 
rotate-gripper  operator.  These  extensions  are  based 
on  the  creation  of  an  underlying  domain  theory  for  the 
construction  of  new  operators.  In  the  limit,  a  complete 
domain  theory  can  be  acquire  through  ouside  guid¬ 
ance,  with  only  very  limited  predefined  knowledge  of 
the  task. 

Section  2  presents  the  system  architecture.  Sec¬ 
tion  3  gives  an  example  of  learning  control  knowledge 
through  outside  guidance.  This  reproduces  the  results 
of  Golding,  Rosenbloom  and  Laird  (1987)  but  in  a  task 
requiring  interaction  with  a  real  environment.  In  Sec¬ 
tion  4  we  extend  this  work  by  demonstrating  the  cor- 


Figure  1:  Robo-Soar  sy..tem  architecture. 


rection  of  control  knowledge  using  outside  guidance.* 
Sections  5  and  6  present  demonstrations  of  the  correc¬ 
tion  and  extension  of  domain  knowledge  through  guid¬ 
ance.  The  final  section  discusses  the  contributions  and 
limitations  of  the  current  approach. 

2  System  Architecture 

The  system  we  are  developing  is  called  Robo-Soar 
[Laird  et  al,  1989].^  Figure  1  shows  its  architec¬ 
ture.  Visual  input  comes  from  a  camera  mounted 
directly  above  the  work  area.  A  separate  computer 
processes  the  images,  providing  asynchronous  input 
to  Soar.  The  vision  processing  extracts  the  positions 
and  orientations  of  the  blocks  in  the  work  area  as  well 
as  distinctive  features  of  the  blocks.  The  vision  and 
robotic  systems  are  sufficiently  accurate  so  that  there 
are  no  significant  sensor  or  control  errors  for  the  block 
manipulation  tasks. 

In  Soar,  a  task  is  solved  by  selecting  and  applying 
operators  to  transform  an  initial  state  to  some  desired 
stale.  Operators  are  grouped  into  probiem  spaces  and 
Soar  selects  an  appropriate  problem  space  for  each 
goal.  For  the  block  manipulation  task,  the  states  are 
different  configurations  of  the  blocks  and  gripper  in  the 

*Soine  of  the  material  in  Sections  2,  3  and  4  has  been 
previously  presented  in  Laird  et  al.,  (1989). 

*Robo-Soar  is  implemented  in  Soar  5,  a  new  version 
which  supports  interaction  with  external  environments 
[Laird  el  al.,  1990]. 
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Figure  2:  Moving  a  block  using  the  primitive  Robo-Soar  operators. 


external  environment.  Some  basic  operators  are  shown 
in  the  trace  of  Robo-Soar  solving  a  simple  block  manip¬ 
ulation  problem  in  Figure  2.  These  operators  generate 
motor  commands  for  the  Puma. 

One  complication  is  that  the  camera  is  mounted  di¬ 
rectly  above  the  work  area  so  that  the  arm  obscures 
the  view  of  a  block  that  is  being  picked  up.  Two  opera¬ 
tors,  snap-in  and  snap-out,  move  the  arm  in  and  out 
of  the  work  area  so  that  a  clear  image  can  be  obtained. 
These  operators  are  necessary,  but  for  simplicity,  they 
will  not  be  incLded  in  any  of  the  examples. 

This  characterization  of  Robo-Soar  does  not  distin¬ 
guish  it  from  any  other  robot  controller.  What  is  dif¬ 
ferent  is  the  way  Soar  makes  the  decisions  to  select  an 
operator.  Many  AI  or  robotic  systems  create  a  plan 
that  the  robot  must  execute  to  select  one  operator  af¬ 
ter  another.  Insterid  of  creating  a  plan.  Soar  makes 
each  decision  based  on  a  consideration  of  its  long-term 
knowledge,  its  perception  of  the  environment,  and  its 
own  internal  state  and  goals.  Soar’s  long-term  knowl¬ 
edge  is  represented  as  productions  that  are  continually 
matched  against  the  working  memory  which  includes 
all  input  from  sensors,  guidance  from  an  advisor,  and 
internal  data  structures  and  goals.  Commands  to  the 
robot  controller  are  issued  by  creating  data  structures 
in  working  memory. 

In  contrast  to  traditional  production  systems  such 
as  OPS5,  Soar  fires  all  successfully  matched  produc¬ 
tion  instantiations  in  parallel,  which  in  turn  elaborate 
the  current  situation,  or  create  preferences  for  the  next 
operator  to  be  taken.  Soar’s  language  of  preferences 
allows  productions  to  control  the  selection  of  opera¬ 
tors  by  asserting  that  operators  are  acceptable,  not 
acceptable,  better  than  other  operators,  as  good  as 


others,  and  so  on.  Production  firing  continues  until 
no  additional  productions  match,  at  which  point.  Soar 
examines  the  preferences  and  selects  the  best  operator 
for  the  current  situation.  Once  an  operator  is  selected, 
productions  can  fire  to  apply  the  operator.  If  the  oper¬ 
ator  affects  the  external  environment,  the  productions 
create  commands  to  the  motor  system.  If  the  oper¬ 
ator  is  used  for  planning  or  internal  processing,  the 
productions  directly  modify  the  internal  data  struc¬ 
tures  in  working  memory.  Additional  productions  test 
for  operator  completion  and  signal  that  a  new  operator 
can  be  selected. 

In  a  familiar  domain,  Soa.’s  knowledge  may  be  ad¬ 
equate  to  select  and  apply  an  operator  without  diffi¬ 
culty.  However,  when  the  preferences  do  not  determine 
a  unique  choice,  or  when  the  productions  are  unable 
to  implement  the  selected  operator,  an  impasse  arises 
and  ioar  automatically  generates  a  subgoal.  In  the 
subgc  al,  Soar  uses  the  same  approach;  it  selects  and 
applies  operators  from  an  appropriate  problem  space 
to  achieve  the  subgoal.  The  operators  in  the  subgoal 
can  modify  or  query  the  environment,  or  they  may  be 
completely  internal,  possibly  simulating  external  op¬ 
erator  applications  on  internal  representations. 

When  Soar  creates  results  in  its  subgoals,  it  learns 
productions,  called  chunks,  that  summarize  the  pro¬ 
cessing  that,  occurred.  Tiie  actions  of  a  cIiuii’k  aic 
based  on  the  results  of  the  subgoal.  The  conditions 
are  based  on  only  those  working-memory  elements  that 
were  tested  and  found  necessary  to  derive  the  results 
Thus,  knowledge  used  to  control  the  selection  of  oper¬ 
ators  in  the  subgoal  is  not  included  in  the  derivation 
because  it  affects  only  the  efficiency  of  producing  the 
results,  not  their  correctness. 
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Figure  3:  Trace  of  problem  solving  using  external  guidance  to  suggest  appropriate  task  operators  for  evaluation. 
The  problem  solving  goes  left  to  right  with  squares  representing  states,  while  horizontal  arcs  represent  operator 
applications.  Downward  pointing  arcs  are  used  to  represent  the  creation  of  subgoals,  and  upward  pointing  arcs 
represent  the  termination  of  subgoals. 


3  Learning  Control  Knowledge 

To  attempt  the  block  manipulation  task,  knowledge 
about  the  operators  must  be  encoded  as  productions. 
This  knowledge  includes  productions  that  suggest  op¬ 
erators  whenever  they  can  be  applied  legally,  as  well 
as  productions  that  implement  the  operators  by  cre¬ 
ating  motor  commands  to  move  the  arm.  Because  the 
feedback  from  the  vision  system  is  incomplete  when 
the  arm  is  being  used  to  pick  up  a  block,  some  pro¬ 
ductions  must  dso  create  internal  expectations  of  the 
position  of  blocks  until  feedback  is  received  when  the 
arm  is  moved  out  of  the  way. 

With  just  this  basic  knowledge,  Robo-Soar  can  at¬ 
tempt  a  task,  but  it  will  encounter  impasses  whenever 
it  tries  to  select  a  task  operator,  as  shown  in  Figure  3. 
To  resolve  these  tie  impasses,  the  tied  task  operators 
are  evaluated  in  a  subgoal  and  preferences  are  created 
to  pick  the  best  one.  The  evaluations  are  carried  out 
by  operators  created  in  the  subgoal  of  the  tie  impasse. 
Within  this  subgoal,  a  decision  must  be  made  as  to 
which  evaluation  operator  should  be  selected  first,  and 
thus,  which  task  operator  should  be  evaluated  first.  If 
the  task  operator  that  leads  to  the  goal  is  evaluated 
first,  the  other  task  operators  can  be  ignored  because 
the  ‘best’  operator  has  been  found.  As  with  the  orig¬ 
inal  decision  to  select  between  the  task  operators,  the 
decision  to  select  an  operator  to  be  evaluated  will  lead 
to  a  tie  impasse. 

Outside  guidance  can  be  used  to  select  the  best  op¬ 
erator  to  evaluate.  The  guidance  is  used  to  determine 
which  operator  to  evaluate  first,  not  which  operator 


to  apply.  The  advantage  of  this  approach  is  that  the 
guidance  acts  only  as  a  heuristic  that  is  verified  by 
internal  problem  solving.  The  problem  solving  calcu¬ 
lates  the  internal  evaluation  and  determines  whether 
the  chosen  operator  is  really  appropriate.  The  system 
can  then  learn  the  conditions  under  which  the  oper¬ 
ator  is  appropriate.  If  advice  directly  selected  a  task 
operator  that  led  to  motor  commands,  there  would  be 
no  internal  analysis  of  the  operator  that  could  be  used 
for  learning. 

To  acquire  the  guidance  within  the  subgoal,  Robo- 
Soar  uses  its  advise  problem  space  which  has  operators 
that  print  out  the  acceptable  task  operators,  ask  for 
guidance,  and  wait  for  a  response.  If  guidance  is  avail¬ 
able,  the  appropriate  operator  is  selected  for  evalua¬ 
tion.  If  no  guidance  arrives  while  the  system  is  waiting, 
a  random  selection  is  made.  All  guidance  in  Soar  takes 
this  form  where  the  advisor  selects  between  competing 
operators.  This  restricts  the  guidance  to  be  from  pre¬ 
defined  alternatives  (the  tied  operators)  and  does  not 
allow  the  input  of  arbitrary  data  structures. 

Once  an  operator  has  been  selected  to  be  evaluated, 
an  impasse  arises  because  there  are  no  productions 
that  can  directly  compute  the  evaluation.  The  de¬ 
fault  response  to  this  impasse  is  to  simulate  the  task 
operator  on  an  internal  copy  of  the  external  environ¬ 
ment.  This  requires  an  internal  model  of  the  precon¬ 
ditions  and  effects  of  the  operators  which  correspond 
to  the  domain  theory  of  an  EBL  system.  The  internal 
search  continues  through  the  recursive  creation  of  tie- 
impasses,  advice,  and  evaluation  until  a  state  is  found 
that  achieves  the  goal. 
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After  the  goal  is  achieved  within  the  internal  search, 
preferences  are  created  to  select  those  operators  eval¬ 
uated  on  the  path  to  the  goal.  Each  of  these  prefer¬ 
ences  is  a  result  of  a  tie-impasse,  and  chunks  are  built 
to  summarize  the  processing  that  led  to  their  creation. 
This  processing  includes  all  the  dependencies  between 
the  operator’s  actions  and  the  preconditions  and  ac¬ 
tions  of  the  operators  that  were  applied  after  it  to  fi¬ 
nally  achieve  the  goal,  essentially  the  same  as  the  goal 
regression  techniques  in  EBL  [Hosenbloom  k  Laird, 
1986].  Figure  4  is  an  example  of  the  production  that 
is  learned  for  the  approach  operator.  Notice  that  the 
production  not  only  tests  aspects  of  the  current  situ¬ 
ation,  but  also  aspects  of  the  goal.  The  productions 
learned  from  this  search  are  quite  general  and  do  not 
include  any  tests  of  the  exact  positions  or  names  of 
the  blocks  because  these  features  were  not  tested  in 
the  subgoal.  The  internal  search  is  based  on  relation¬ 
ships  such  as  lef  t-of  or  above  instead  of  the  exact  x, 
y,  and  z  locations  of  the  blocks. 

If  the  approach  operator  is  applicable,  and 
the  gripper  is  holding  nothing 

in  the  safe  plane  above  a  block, 
and  that  block  aust  be  aoved 
to  achieve  the  goal, 
then  create  a  best  preference  for 
the  approach  operator. 

Figure  4:  Example  production  learned  by  Robo-Soar. 

Following  the  look-ahead  search,  Robo-Soar  has 
learned  which  operator  to  apply  at  each  decision  point. 
Robo-Soar  applies  the  operators  by  generating  motor 
commands  and  moving  the  real  robot  arm.  This  is 
not,  however,  a  blind  application  of  a  plan.  Each  of 
the  learned  productions  will  test  aspects  of  the  envi¬ 
ronment  to  insure  that  an  operator  is  selected  only 
when  appropriate. 

4  Correcting  Control  Knowledge 

A  problem  with  the  analytic  learning  approaches  such 
as  EBL  and  chunking  is  that  the  learning  is  only  as 
good  as  the  underlying  knowledge.  If  there  is  an  er¬ 
ror  in  the  original  domain  theory,  the  learning  will 
preserve  the  error.  External  guidance  is  of  no  help 
when  restricted  to  suggesting  task  operators  because 
the  error  can  be  in  the  underlying  implementation  of 
operators  used  by  the  learning  procedure. 

We  consider  a  simple  case  of  this  problem  by  at¬ 
tempting  the  same  task  as  before  except  with  blocks 
shaped  as  triangular  prisms.  If  the  original  operators 
were  encoded  with  only  cubes  in  mind,  all  of  the  con¬ 
trol  knowledge  and  underlying  simulation  would  be 
insensitive  to  a  feature  in  the  input  that  must  be  at¬ 
tended  to.  To  the  Robo-Soar  vision  system,  the  prisms 
look  just  like  cubes,  except  for  a  line  down  the  middle 
at  the  apex  of  the  triangle.  In  order  to  pick  up  these 


blocks,  the  gripper  must  be  aligned  with  the  vertical 
faces  of  the  block,  not  just  any  two  sides  as  with  a 
cube.  Figure  5  shows  the  operators  Robo-Soar  applies 
for  this  problem.  If  the  gripper  is  not  correctly  aligned, 
the  gripper  will  close  but  not  grasp  the  block.  Upon 
withdrawing  the  gripper,  the  block  will  not  be  picked 
up. 

There  are  many  possible  machine  learning  ap¬ 
proaches  that  could  be  used  to  correct  the  underlying 
knowledge.  First,  the  system  could  have  an  underlying 
“subdomain”  theory  [Doyle,  1986]  of  inclined  planes, 
grippers,  and  friction  that  it  uses  to  understand  why 
the  block  was  not  picked  up.  This  requires  knowing  be¬ 
forehand  that  this  knowledge  will  be  necessary,  and  for 
many  tasks  this  additional  domain  knowledge  may  be 
difficult  to  obtain.  A  second  approach  b  to  gather  ex¬ 
amples  of  failure  and  use  inductive  learning  techniques 
to  hypothesize  which  feature  in  the  environment  was 
responsible  for  the  problem.  This  may  identify  the 
feature,  but  it  requires  many  failures  and  also  gives  no 
hint  as  to  the  appropriate  action.  A  third  approach 
is  for  the  system  to  experiment  with  its  available  op¬ 
erators  to  see  what  actually  works  [Carbonell  &  Gil, 
1987].  This  approach  can  be  quite  effective,  but  it  also 
can  be  quite  time  consuming  and  possibly  dangerous. 
Our  approach  involves  increasing  the  interaction  be¬ 
tween  the  advisor  and  the  robot  so  that  the  advisor 
can  point  out  relevant  features  in  the  environment  and 
associate  them  with  the  potential  success  or  failure  of 
a  given  operator  or  set  of  operators. 

This  approach  incorporates  outside  guidance  with 
prior  work  in  Soar  on  recovery  from  incorrect  knowl¬ 
edge  [Laird,  1988].  Instead  of  correcting  the  pro¬ 
ductions,  our  recovery  scheme  learns  new  productions 
whose  actions  correct  the  decision  affected  by  the  in¬ 
correct  production.  The  advisor  provides  guidance  in 
re-evaluating  the  operators  being  consider  for  a  deci¬ 
sion,  leading  to  new  productions  that  correct  the  er¬ 
ror.  This  process  of  recovery  is  a  domain-independent 
strategy  encoded  as  productions. 

The  recovery  method  b  invoked  when  the  system 
notices  that  an  error  has  been  made.  In  Robo-Soar, 
the  vision  system  detects  that  the  prism  block  is  still 
on  the  table  following  an  attempt  to  pick  it  up.  The 
decision  that  must  be  corrected  is  the  choice  of  the 
approach  operator  when,  in  fact,  the  gripper  should 
be  rotated  so  that  it  is  aligned  with  the  vertical  sides  of 
the  prism  block.  Figure  6  shows  an  abbreviated  trace 
of  the  problem  solving  to  correct  this  decision. 

Once  an  error  has  been  detected,  the  system  tries 
to  push  forward  to  the  goal,  but  more  uelibeiaLel>,  so 
that  the  errant  control  knowledge  does  not  select  the 
wrong  operator.  Previously  learned  control  knowledge 
is  overridden  by  forcing  impasses  for  every  task  op¬ 
erator  decision.  Within  the  context  of  each  of  these 
impasses,  all  of  the  available  operators  can  be  evalu¬ 
ated  and  new  preferences  can  be  created  to  modify  a 
decision  if  it  is  incorrect. 

The  underlying  internal  domain  knowledge  that  gen- 
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Figure  5:  Trace  of  operator  sequence  using  recovery. 
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Figure  6;  Trace  of  problem  solving  with  recovery.  The  subgoals  that  allow  outside  guidance  are  omitted  and 
would  be  used  to  select  operators  in  both  the  selection  and  examine-state  problem  spaces. 


erates  the  evaluation  may  also  be  incorrect.  Therefore, 
the  evaluation  process  is  modified  so  that  outside  guid¬ 
ance  can  be  used  to  evaluate  an  operator  and  associate 
that  evaluation  with  relevant  features  of  the  environ¬ 
ment.  This  modified  evaluation  is  performed  in  the 
examine-staie  problem  space. 

The  examine-state  problem  space  is  an  underlying 
theory  for  determining  the  features  of  the  environment 
that  are  relevant  to  the  operator  being  evaluated  and 
relating  them  to  a  .specific  evaluation.  There  are  three 
classes  of  operators  in  the  problem  space.  The  first 
class,  called  notice-feature,  explicitly  tests  the  ex¬ 
istence  of  a  feature  of  the  current  task  state  or  goal. 
This  operator  allows  this  system  to  explicitly  search 
through  the  feature  space  of  the  task.  In  our  example, 
this  allows  the  system  to  test  the  line  down  the  mid¬ 
dle  of  the  prism  block,  which  was  previously  ignored 
by  the  task  productions.  For  future  reference  in  this 


paper  we  will  call  this  feature  block-orientation.  The 
second  class  of  operators,  called  compeure-features, 
can  perform  simple  comparisons  between  noticed  fea¬ 
tures,  such  as  detecting  that  two  features  have  the 
same  value.  The  third  class  of  operators  creates  an 
evaluation,  such  as  success  or  failure,  for  the  task  op¬ 
erator  being  evaluated.  Together,  these  three  classes  of 
operators  provide  a  complete  domain  theory  for  com¬ 
puting  evaluations  and  relating  these  evaluations  to 
relevant  features  of  the  environment.  Although  com¬ 
plete,  it  is  underconstrained  because  any  evaluation 
can  be  paired  with  any  set  of  features. 

Externa]  guidance  can  lead  Robo-Soar  to  select  the 
operators  that  notice  only  those  features  relevant  to 
the  current  task  state  and  associate  them  with  the 
appropriate  evaluation.  If  an  operator  is  deemed  to 
be  on  the  path  to  success,  a  preference  will  be  cre¬ 
ated  to  prefer  it  over  an  operator  that  leads  to  fail- 
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ure.  In  our  example,  the  advisor  first  points  out  that 
the  approach  operator  will  fail  when  the  gripper  is 
above  a  block  where  the  orientation  of  the  operator 
is  not  aligned  with  the  block-orientation.  The  advisor 
then  suggests  evaluating  the  rotata-gripper  oper2i- 
tor,  and  points  out  the  relevant  features  that  make 
its  selection  desirable.  Figure  7  shows  the  produc¬ 
tions  that  are  learned  to  avoid  the  approach  operator 
and  select  the  rotato-grippar  operator  whenever  the 
gripper  is  not  aligned. 

If  the  approach  operator  ia  applicable,  and 
the  gripper  is  above  a  block,  and 
the  gripper’s  orientation  is  different 

froa  a  line  in  the  middle  of  the  block, 
then  create  a  preference  to  reject  approach. 

If  the  rotate  operator  is  available,  and 
the  gripper  is  above  a  block,  and 
the  gripper’s  orientation  is  different 

fron  a  line  in  the  niddle  of  the  block, 
then  create  a  best  preference  for  rotate-gripper. 

Figure  7:  Example  correction  productions  learned  by 
Robo-Soar. 

Once  these  evaluations  are  made,  the  recovery 
knowledge  detects  that  the  previously  preferred  op¬ 
erator  is  now  rejected,  and  therefore  assumes  that  the 
error  has  been  corrected.  The  error  signal  is  removed 
so  that  future  decisions  are  made  without  forced  im¬ 
passes.  From  this  point,  ihe  chunks  apply  and  take 
Robo-Soar  to  the  solution,  as  sliown  in  Figure  5.  In 
future  situations,  Robo-Soar  correctly  aligns  the  grip¬ 
per  before  approaching  a  prism.  If  errors  still  exist, 
the  advisor  can  signal  this  by  merely  typing  error  to 
Robo-Soar.  The  advisor  can  also  signal  that  an  er¬ 
ror  has  been  fixed  if  Robo-Soar  is  unable  to  detect  it 
automatically. 

5  Correcting  Domain  Knowledge 

Although  the  method  described  in  the  previous  sec¬ 
tion  corrects  control  knowledge,  it  does  not  correct 
the  underlying  domain  knowledge;  specifically,  it  does 
not  add  the  precondition  for  the  approach  op  rator 
that  the  gripper  be  aligned  with  the  block.  Although 
the  new  control  knowledge  prevents  the  operator  from 
being  applied,  the  missing  operator  preconditions  will 
lead  to  errors  in  learning  when  the  operator  is  cor¬ 
rectly  applied.  Learning  will  be  incorrect  because  the 
missing  precondition.'!  will  not  be  incorporated  in  fu¬ 
ture  chunks  that  depend  upon  the  application  of  the 
approach  operator. 

Instead  of  attempting  to  modify  the  productions 
that  propose  and  implement  the  approach  operator, 
Robo-Soar  creates  a  new  approach  operator  that  re¬ 
places  the  original.  The  new  operator  will  have  the 
appropriate  preconditions  and  will  always  be  preferred 
to  the  original  operator,  the  original  is  essentially  for¬ 


gotten.  The  general  approach  is  to  create  a  new  oper¬ 
ator,  notice  additional  preconditions  in  the  task  state, 
then  learn  the  implementation  of  the  new  operator  us¬ 
ing  the  original  approach  operator.  Throughout  this 
discussion,  all  guidance  is  through  the  advise  problem 
space  as  described  earlier. 

To  control  the  creation  of  the  new  operator,  the  se¬ 
lection  problem  space  is  augmented  so  that  when  there 
is  an  error,  one  alternative  is  to  evaluate  a  completely 
new  operator.  If  the  advisor  selects  this  alternative, 
a  new  operator  is  created  and  evaluated  to  be  better 
than  the  original,  thus  replacing  it.  When  the  decision 
is  made  to  evaluate  the  new  operator,  the  examine- 
state  problem  space  is  used  (along  with  outside  guid¬ 
ance)  and  appropriate  control  productions  are  learned 
to  propose  and  select  this  new  operator. 

At  this  point,  the  system  does  not  have  the  knowl¬ 
edge  to  apply  this  operator;  therefore,  when  the  new 
operator  is  selected,  another  impasse  arises.  In  re¬ 
sponse  to  this  impasse,  the  examine-state  problem 
space  is  again  selected,  but  it  has  been  augmented 
with  an  additional  operator  that  can  apply  a  task  op¬ 
erator.  To  build  the  correct  definition  of  the  new  oper¬ 
ator,  notice-feature  operators  are  selected  through 
outside  guidance  to  incorporate  the  missing  precon¬ 
ditions,  which  for  our  example  are  the  orientation  of 
the  gripper  and  the  block-orientation.  Following  the 
determination  of  the  appropriate  features,  additional 
guidance  can  specify  a  task  operator  that  should  be 
used  to  implement  the  new  operator  in  the  subgoal.  In 
this  case,  it  would  be  the  original  approach  operator. 
This  operator  is  applied  to  the  task  state  within  the 
subgoal  and  the  changes  it  makes  to  the  state  are  re¬ 
sults  of  the  subgoal.  These  results  lead  to  the  creation 
of  chunks  that  test  for  the  new  operator  and  its  pre¬ 
conditions,  and  then  apply  the  operator  to  the  state. 
The  new,  corrected  operator  replaces  the  old  operator, 
thus  correcting  the  underlying  domain  theory. 

6  Extending  Domain  Knowledge 

The  method  we  have  described  for  creating  a  new  op¬ 
erator  using  an  existing  operator  definition  can  be  ex¬ 
tended  so  that  a  new  operator  can  be  learned  from 
scratch  through  guidance.  This  is  useful  for  building  a 
new  domain  theory,  as  well  as  completing  an  existing 
domain  theory.  In  Robo-Soar,  we  consider  the  situa¬ 
tion  in  which  the  original  programmer  decided  that  it 
was  not  necessary  to  include  a  rotate-gripper  oper¬ 
ator. 

We  extend  the  previous  approach  by  adding  domain- 
independent  knowledge  that  can  generate  the  individ¬ 
ual  actions  of  an  operator.  This  knowledge  is  encoded 
as  additional  operators  in  the  examine-state  problem 
space.  These  operators  modify  or  remove  existing 
structures  on  the  state  or  new  task  operator,  create 
new  intermediate  structures,  or  terminate  the  new  op¬ 
erator.  Once  the  new  operator  terminates,  chunk¬ 
ing  creates  productions  that  implement  it  without  the 
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noed  for  further  impaases  or  guidance. 

To  teach  the  system  an  internal  simulation  of 
rotate-gripper,  the  orientation  of  the  gripper  and 
the  block  are  noticed,  and  then  the  gripper  orientar 
tion  is  modified  to  be  the  orientation  of  the  block.  The 
exact  values  (and  representations)  of  the  orientations 
of  the  block  and  the  gripper  are  irrelevant.  All  that 
is  needed  is  copy  a  pointer  of  the  orientation  of  the 
block  on  to  the  data  structure  representing  the  ori¬ 
entation  of  the  gripper.  From  a  single  example  the 
system  learns  a  general  production  for  implementing 
rotate-grippar. 

These  extensions  are  sufficient  for  creating  operators 
that  modify  or  remove  existing  structures  or  create 
new  intermediate  structures.  They  are  insufficient  for 
implementing  operators  that  must  create  new  struc¬ 
tures  with  specific  symbols  that  do  not  pre-exist  in 
working  memory.  The  problem  is  that  there  is  no  way 
to  generate  these  symbols  so  that  the  can  be  selected 
by  an  advisor.  This  is  the  one  case  where  guidance  by 
selecting  from  a  fixed  set  of  alternatives  breaks  down. 
Fortunately,  the  only  lime  that  specific  symbols  are 
necessary  is  when  issuing  commands  to  the  motor  sys¬ 
tem.  Therefore,  we  have  included  an  operator  that 
can  generate  all  of  the  robot  command  symbols,  such 
as  move,  open,  and  rotate.  This  is  the  only  domain- 
dependent  knowledge  that  must  be  pre-encoded  in  the 
.system. 

All  of  these  extensions  expand  the  examine-state 
problem  space  so  that  it  has  aufficient  symbol  ma¬ 
nipulation  capabilities  for  creating  and  implement¬ 
ing  task  operators  through  the  composition  of  primi¬ 
tive  domain-independent  operators.  However,  outside 
guidance  is  necessary  to  control  the  composition  of 
features  and  actions  so  that  only  legal  operator  im¬ 
plementations  are  generated. 

In  a  previous  version  of  boar,  the  .system  was  taught 
to  play  Tic  Tac-Toe  from  scratch.  The  system  ini¬ 
tially  had  no  notion  of  two  player  games,  three  in-a- 
row,  winning,  or  losing.  The  system  did  have  an  initial 
representation  of  the  board,  the  symbols  X  and  O,  and 
the  command  to  make  a  move.  Through  outside  guid¬ 
ance,  operators  were  created  to  pick  the  side  to  move 
next,  make  a  move  of  the  chosen  side,  wait  for  the  op¬ 
ponent  to  make  a  move,  and  detect  winning  and  losing 
positions. 

7  Discussion 

There  are  iwo  major  contributions  of  tlil.-s  work.  First, 
vve  iia.V€  uciuuiibirattu  iiidi  ii  ib  iu  cxicitu  du- 

aly  tic  learning  systems  so  that  not  only  can  they  learn 
Control  knowledge  using  an  existing  domain  theory, 
but  through  outside  guidance  they  can  also  be  used 
to  correct  and  cre.".te  new  domain  knowledge.  The 
examine  state  problem  space  is  somewhat  of  a  bruie- 
force  technique  to  learn  new  features  and  operators. 
It  currently  requires  an  outside  agent  to  lead  (he  sys¬ 
tem  through  a  search  of  poieiitially  relevant  features 


and  actions.  Although  it  may  not  be  considered  the 
most  elegant  or  complex  machine  learning  technique, 
it  allows  the  advisor  to  easily  correct  and  extend  the 
system. 

This  same  approach  could  be  used  without  an  out¬ 
side  agent  by  having  the  system  engage  in  experimen¬ 
tation.  To  experiment,  the  system  can  guess  at  rele- 
vruit  features  and  associated  actions.  It  may  pay  atten¬ 
tion  to  irrelevant  features,  and  thus  create  overgeneral 
or  incorrect  chunks.  But  after  many  interactions  with 
an  environment  where  there  is  sufficient  feedback,  it 
could  gradually  learn  those  correct  associations.  Many 
powerful  heuristics  are  available  to  avoid  a  blind  search 
through  this  hypothesis  space,  such  as  concentrating 
on  new,  unknown  features,  as  well  as  those  features 
that  are  modified  by  the  operators  under  considera¬ 
tion.  For  example,  if  the  system  has  discovered  that 
the  rotate  operator  is  necessary,  it  could  concentrate 
its  search  for  features  relevant  to  avoiding  approach  to 
those  features  modified  by  rotate-gripper.  This  ap¬ 
proach  could  also  support  hybrid  methods  that  involve 
both  outside  guidance  and  experimentation,  where  the 
system  experiments  when  on  its  own  but  uses  guidance 
when  it  is  available. 

The  second  major  contribution  is  to  demonstrate  the 
generality  of  guidance  in  knowledge  acquisition.  The 
form  of  guidance  we  allow  is  very  restricted  in  that  the 
advisor  must  pick  from  a  set  of  available  options.  One 
advantage  of  this  scheme  is  that  the  advice  is  given 
within  the  context  of  a  specific  problem  and  advice 
is  asked  only  for  those  decisions  for  which  the  system 
has  incomplete  knowledge.  A  second  advantage  is  that 
the  advisor  does  not  have  to  make  explicit  the  reasons 
for  the  selection  of  an  operator  for  which  the  system 
has  a  correct  internal  model;  the  learning  mechanism 
performs  the  necessary  analysis.  A  third  advantage  is 
that  the  guidance  and  learning  occur  while  the  system 
is  running.  There  is  no  need  to  ever  turn  off  the  per¬ 
formance  system  to  update  or  correct  its  knowledge 
base.  Final'y.  by  integrating  the  guidance,  the  prob¬ 
lem  solving  and  learning  within  a  single  architecture 
such  as  Soar,  the  guidance  can  be  used  to  correct  or 
extend  any  of  the  long  term  knowledge  of  the  system. 

The  weakness  of  this  approach  is  that  the  advisor 
sometimes  must  specify  individual  preconditions  and 
effects  of  an  operator.  This  can  be  quite  tedious  and 
it  requires  the  human  to  identify  which  preconditions 
or  effects  that  are  missing  when  correcting  a  domain 
theory.  These  problems  would  be  greatly  reduced  if 
our  interface  were  improved  so  that  tlie  advisor  could 
niotu  din,cu>  obsci  VL  the  oli  uciuic  of  Lht.  cuiioiil  stale 
and  operator.  A  more  long-term  solution  is  to  provide 
the  system  with  additional  knowledge  that  allows  it  to 
perform  more  of  the  diagnosis  and  correction  by  itself. 

The  goal  of  our  research  was  to  demonstrate  the 
p.'acticality  and  generality  of  learning  using  only  ana¬ 
lytic  techniques  combined  with  outside  guidance.  Al¬ 
though  the  deinonstrat’ons  were  performed  within  the 
Soar  architecture,  the  results  should  extend  to  siiiil- 
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lar  systems  such  as  Prodigy  [Minton  ei  al,  1989]  and 
Theo  [Mitchell  ei  al,  1990]  that  combine  problem  solv¬ 
ing  and  EBL.  These  systems  may  require  some  archi¬ 
tectural  extensions,  for  example,  adding  the  ability  to 
cast  operator  creation,  selection,  and  implementation 
as  subproblems. 

The  actual  task  performed  by  Robo-Soar  was  quite 
simple,  and  did  not  address  many  of  the  complexi¬ 
ties  of  interacting  with  external  environments,  such  as 
dealing  with  sensor  and  control  errors.  Our  current 
goal  is  to  extend  Robo-Soar  to  more  complex  tasks 
and  expand  the  spectrum  of  human  interaction.  At 
one  end,  we  plan  to  investigate  refining  the  guidance 
so  that  it  is  easier  to  correct  and  extend  the  domain 
theory,  approaching  the  goals  of  the  Instructable  Pro¬ 
duction  System  where  a  system  is  never  programmed, 
only  given  external  guidance  [Rychener,  1983].  On  the 
other  end  of  the  spectrum,  we  plan  to  study  experi¬ 
mentation  techniques  so  that  Robo-Soar  will  be  able 
to  learn  much  of  the  same  information  on  its  own, 
when  human  guidance  is  unavailable. 
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Abstract 

To  make  efficient  use  of  a  dynamic  sys¬ 
tem  such  as  a  mechanical  manipulator,  the 
robotic  controller  needs  various  models  of  its 
behaviour.  I  describe  a  method  of  learning  in 
which  all  the  experiences  in  the  lifetime  of  the 
robot  are  explicitly  remembered.  They  are 
stored  in  a  manner  which  permits  fast  recall 
of  the  closest  previous  experience  to  any  new 
situation.  This  leads  to  a  very  high  rate  of 
learning  of  the  robot  kinematics  and  dynam¬ 
ics  which  conventionally  need  to  be  derived 
analytically.  The  representation  is  a  mod¬ 
ified  binary  multidimensional  tree  called  a 
sab-ttee  which  stores  state-action-behaviour 
triples.  This  permits  fast  prediction  of  the 
effects  of  proposed  actions  and,  given  a  goal 
behaviour,  permits  fast  generation  of  a  can¬ 
didate  action.  I  also  discuss  how  the  system 
is  made  resistant  to  noisy  inputs  and  adapts 
to  environmental  changes.  I  explain  how  ap¬ 
propriate  actions  can  be  selected  in  the  cases 
where  (i)  there  has  been  earlier  success  and 
(ii)  experimentation  is  required.  This  can 
be  used  to  transform  dynamic  control  to  a 
greatly  simplified  problem.  I  conclude  with 
some  simulated  experiments  which  exhibit 
high  rates  of  learning.  The  final  experiment 
also  illustrates  how  a  compound  learning  task 
can  be  structured  into  a  hierarchy  of  simple 
learning  tasks. 

1  Introduction 

To  make  efficient  use  of  a  dynamic  system  such  as  a 
mechanical  manipulator,  the  robotic  controller  needs 
models  of  various  aspects  of  its  behaviour.  Derivation 
of  these  models  is  an  issue  of  central  importance  in 
robotics.  They  can  either  be  hardwired  into  the  con¬ 
troller  upon  its  creation,  or  else  the  controller  can  at¬ 
tempt  to  learn  them  itself,  using  observations  from  its 
sensors.  The  adaptability  of  the  tatter  method  is  ap¬ 
pealing.  Despite  this,  conventional  control  has  almost 
invariably  used  the  former  method. 

It  is  desirable  for  a  robotic  learning  system  to  learn 


efficiently.  Firstly,  the  information  processing  needs  to 
take  place  in  real  time.  Observations  about  the  world 
need  to  be  recorded  and  decisions  need  to  be  made 
within  the  timescale  of  the  robot’s  dynamics.  Fur¬ 
thermore,  if  learning  is  computationally  cheap  then  it 
need  never  be  switched  off.  Thus  the  robot  can  adapt 
to  change  throughout  its  lifetime.  Secondly,  the  rate 
of  learning  should  be  high.  This  will  reduce  the  expen¬ 
sive  (and  possibly  hazardous)  period  of  time  taken  up 
for  training.  Furthermore,  a  robot  with  a  high  rate  of 
learning  can  adapt  quickly  to  changes  in  its  environ¬ 
ment.  The  rate  of  learning  depends  particularly  on  the 
quality  of  the  system’s  generalization  abilities.  For  a 
robotic  system  which  learns  by  monitoring  itself  there 
is  a  second  factor  crucial  to  the  rate  of  learning:  how 
and  when  should  experimental  actions  be  applied? 

In  this  paper  I  discuss  an  investigation  into  a  practi¬ 
cal  learning  approach  for  robotic  systems  which  simply 
uses  as  its  performance  element,  the  explicit  set  of  all 
the  state  information  it  has  received  through  its  sen¬ 
sors. 

1.1  Manipulator  Modelling 

The  principal  example  in  this  paper  will  be  a  multi- 
jointed  manipulator.  Its  controller  is  able  to  observe 
the  end-point  of  the  arm  by  means  of  a  number  of 
cameras  connected  to  a  framestore — the  robot’s  retina. 

The  conventional  method  first  transforms  this  loca¬ 
tion  according  to  a  perspective  model  into  real  world 
coordinates  then  secondly,  via  a  kinematic  model,  into 
joint  space  coordinates.  Thirdly,  a  dynamic  model  of 
the  arm  evaluates  the  effect  which  gravity  and  joint 
torques  will  have  on  the  joint  angles  and  angular  ve¬ 
locities  of  the  arm.  The  fourth  component,  the  con¬ 
trol  model,  uses  errors  in  the  perceived  location  of  the 
arm’s  end-point  to  decide  which  torques  need  be  ap¬ 
plied  to  lessen  the  error  in  future. 

This  approach  has  been  traditionally  applied  with  a 
fair  degree  of  success  [Fu  et  ai,  1987]  but  has  several 
drawbacks.  As  well  as  the  expense  and  complexity, 
the  major  pioblem  is  adaptability  and  the  dependence 
upon  accurate  values  of  system  parameters.  Some 
work  has  attempted  to  eliminp.te  the  prespecified  mod¬ 
els  entirely  b>  instead  making  the  system  learn  them. 
For  example,  in  [Barto  et  al,  1983]  the  control  and  dy¬ 
namics  component  were  learned,  while  [Clocksin  and 
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Moore,  1989]  investigates  the  learning  of  the  perspec¬ 
tive  and  kinematic  models  for  a  five  joint  manipulator. 
In  [Mel,  1989]  the  kinematics  and  the  relationship  be¬ 
tween  the  differential  kinematics  and  changes  to  the 
explicit  retinal  image  were  learned. 

2  Proximal  Learning 

In  this  work,  I  investigate  a  system  which  learns 
the  perspective,  kinematic  and  dynamics  relations  in 
one  compound  mapping,  the  proximal  stale  transition 
function  (PSTF): 


(p  X  p)  X  t  p  (1) 

where  p  is  the  proximal  coordinates  of  the  arm  on 
the  robot’s  retina,  and  t  is  the  vector  of  torques  ap¬ 
plied.  As  I  will  explain  later,  a  further  result  of  using 
this  mapping  is  that  the  control  model  can  become 
extremely  simple. 

2.1  Performance  Element 

The  performance  element  is  a  structure  which  adapts 
according  to  a  set  of  exemplars — observations  of  the 
real  world.  Each  exemplar  consists  of  three  vectors  of 
real  numbers,  called  s,  a  and  b.  s  is  an  element  of 
the  proximal  state  space — the  proximal  position  and 
proximal  velocity,  a  is  the  control  action,  a  vector 
of  joint  torques  to  be  supplied.  6  is  the  observed  be¬ 
haviour,  the  proximal  acceleration.  These  exemplars 
are  glimpses  of  the  underlying  mapping  which  we  wish 
to  learn: 


/  :  S  X  A  B  (2) 

The  two  useful  tasks  required  of  the  PSTF  are: 


1.  Prediction:  Given  a  value  (s,a)  G  S  x  A,  what 
is  f{s,  a)? 

2.  Partial  Inversion:  Given  a  value  s  G  S  and  a 
target  value  6  G  B,  wha'  "alue  of  a  G  A  (if  any) 
will  give  /(s,  a)  =  6? 


This  work  uses,  as  its  performance  element,  the  ex¬ 
plicit  set  of  all  observed  exemplars.  We  call  this  set 
E.  Prediction  is  based  on  the  nearest  neighbour  by 
the  assumption  of  smoothness  of  the  relation.  To  pre¬ 
dict  f{s,a),  find  the  (si,ai,6,)  G  E  for  which  (si,ai) 
is  closest  to  (s,  a).  The  predicted  behaviour  is  6,-.  The 
notion  of  “closeness”  is  provided  by  the  Euclidean  met¬ 
ric,  with  the  components  of  s,- ,  a,  and  6,  all  scaled  uni¬ 
formly  from  their  maximum  ranges  to  the  range  [0, 1]. 

aU- _ _ _ _ _ i  «-a.,^a _ 

XU  tuc  laii^co  \Ji  utjtu  at;;u9Uio  aiiu  avvucvtui;^ 

are  generally  explicitly  known  to  the  system  designer. 
If  not  they  can  be  discovered. 

Given  s  and  a  desired  6,  partial  inversion  could  be 
accomplished  by  finding  a  (s,-,  o,-,  6,)  G  E  such  that  s  = 
Si  and  b  =  bi.  However,  in  general  such  an  exemplar 
will  not  be  available.  Instead,  the  exemplar  with  the 
nearest  (fi|,  6,)  to  (s,  6)  can  be  expected  to  have  a  good 
a,,  provided  that  the  predicted  value  of  f{s,a,)  is  6,. 

The  nearest  neighbour  generalization  has  several  ad¬ 
vantages.  The  most  important  is  the  one-shot  learning 


behaviour.  A  second  advantage  is  that  the  generaliza¬ 
tion  adapts  to  dif  resolutions  of  interest,  and  in 
particular  there  i^  need  to  quantize  the  data.  These 
issues  are  expanded  upon  in  [Clocksin  and  Moore, 
1989].  The  error  between  the  nearest  neighbour  value 
and  the  correct  value  is  expected  to  be  inversely  pro¬ 
portional  to  the  local  density  of  exemplars. 

The  representation  of  E  should,  as  well  as  permit¬ 
ting  quick  insertion  of  exemplars,  provide  fast  searches 
of  nearest  neighbour  in  the  (s,  a)-components  and  also 
in  the  (s,  5)-components.  This  can  be  achieved  using  a 
sab-ltee.  This  is  a  form  of  a  binary  multidimensional 
data  structure  called  a  id-tree  [Preparata  and  Shamos, 
1985;  Omohundro,  1987],  modified  to  permit  search  on 
both  the  (s,  a)  and  (s,  6)  components  as  well  as  efficient 
smoothing  and  adaption  operations  described  below. 
In  theory,  the  number  of  exemplars  examined  during  a 
nearest  neighbour  search  tends  towards  logn  as  n,  the 
number  of  exemplars,  gets  large.  However,  the  magni¬ 
tude  of  n  for  which  this  behaviour  is  reached  depends 
critically  on  the  distribution  of  the  exemplars.  De¬ 
spite  this,  trials  using  real  data  have  established  that 
the  proportion  of  exemplars  searched  is  sufficiently  low 
to  make  nearest  neighbour  searching  in  real  time  en¬ 
tirely  practical.  Figure  1  shows  the  average  number  of 
nodes  inspected  against  tree  size  for  a  6d-tree  of  robot 
dynamics  exemplars.  The  search  for  the  correct  point 
in  a  sab-ttee  to  insert  a  new  exemplar  is  also  O(logn). 


2.2  Filtering  Noise 

In  practice,  the  exemplars  are  of  the  form 
(Sf,a«,/(sii a»)  +  c(*))  where  e(j)  can  be  treated  as 
some  unspecified  noise  distribution.  The  noise  can  be 
reduced  by  taking  the  weighted  mean  of  several  local 
exemplars.  The  predicted  value  at  (s,a)  is  estimated 
as 


246  Moore 


eval(s,  a) 


Y,{weighi{si,ai)  x  6,) 
Y^weighi{si,ai) 


(3) 


where  weighi{si,ai)  is  a  function  of  the  distance  of 
from  (s,a)  which  decreases  to  zero  beyond  a 
certain  distance  r  from  (s,  o).  The  only  exemplars 
to  make  a  non-zero  contribution  are  those  which  lie 
within  this  distance  and  so  only  a  local  range  of  the 
tree  needs  to  be  inspected.  The  value  of  r  is  a  property 
of  the  individual  sa6-tree  and  can  be  adjusted  accord¬ 
ing  to  how  much  smoothing  is  required. 

To  avoid  the  expense  of  a  range  search  for  each  safe- 
tree  interrogation,  and  also  to  permit  partial  inversion 
to  make  use  of  the  smoothed  values,  the  smoothed 
values  are  stored  with  each  exemplar.  A  node  of  the 
safe-tree  is  thus  a  quadruplet:  (s,-,  a,-,  fe,-,  where 

jsmooth  _  ei;a/(s,-,a,).  Prediction  of  the  smoothed 
value  of  an  unknown  (s,  a)  is  the  fe*'"®®^*'  component 
of  the  nearest  (s,-,  a,-,  6,-,  fe?"’°°‘^). 


2.3  Adapting  to  Change 

The  behaviour  of  a  robotic  system  is  really  of  the  fol¬ 
lowing  form: 


/actual  :  S  X  A  X  time  -*  B  (4) 

This  varies  very  slowly  but  unpredictably  with  time. 
As  time  progresses,  some  of  the  exemplars  will  rep¬ 
resent  inaccurate  values.  Fortunately,  such  exemplars 
can  be  detected:  they  are  (i)  relatively  old  and  (ii)  have 
a  large  discrepancy  between  fe,-  and  fe?>"°o‘h.  These  ex¬ 
emplars  are  deleted  from  E.  The  smoothed  values  of 
local  exemplars  are  updated  to  take  account  of  this 
deletion.  The  age  of  an  exemplar  is  recorded  by  a 
“date  of  birth”  field.  Examples  of  noise  filtering  and 
adapting  to  change  for  a  one  dimensional  mapping  can 
be  seen  in  Figure  2  . 

It  is  important  that  both  smoothing  and  removing 
inadequate  exemplars  are  computationally  cheap,  so 
that  learning  and  adaption  will  occur  during  the  exe¬ 
cution  of  a  task.  Smoothing  occurs  each  time  a  point 
is  added  to  the  tree.  Obtaining  those  local  points  with 
a  non-zero  contribution  to  fe®"'®®'*'  costs  0{S  -f  log  n) 
where  S  is  the  number  of  these  local  points.  Each  lo¬ 
cal  point  has  its  own  fe®"'°°**’  component  updated  to 
take  account  of  the  new  point,  and  it  is  then  that  any 
elderly,  inaccurate  points  are  removed. 

3  Using  safe-trees  for  Control  Choice 

In  this  section  I  consider  how  to  choose  an  action  given 
(i)  the  current  state  Sg  and  (ii)  the  desired  behaviour 
bq.  There  are  two  opposing  aims  in  making  a  control 
choice.  These  are 

1.  Perform.  We  wish  to  perform  as  well  as  possible 
given  the  knowledge  contained  in  the  performance 
element. 

2  Experiment.  In  order  to  perform  better  in  fu¬ 
ture,  it  is  worth  trying  actions  with  no  guaranteed 
success,  but  with  the  chance  of  a  valuable  discov- 


Figure  2:  One  dimensional  SAB-tree,  attempting  to  learn 
a  sine  wave  from  noisy  exemplars,  shown  as  white  dots. 
Black  dots  are  the  smoothed  interpretation.  Top  left:  After 
8  random  exemplars.  Top  right:  After  50.  Bottom  Left: 
The  underlying  function  is  changed.  Bottom  right:  After 
further  50  exemplars. 


Aim  1  represents  doing  the  best  for  the  system  im¬ 
mediately  and  aim  2  represents  an  investment  in  the 
future.  One  approach  to  this  dilemma  is  to  have  two 
modes  of  operation.  The  first  would  be  experimental, 
where  actions  are  chosen  randomly  with  no  reference 
to  previous  experience.  The  second  is  demonstration¬ 
mode  in  which  no  chances  are  taken,  and  the  best 
known  action  is  always  used.  These  modes  are  altered 
by  an  intelligent  supervisor  who  can  judge  when  fur¬ 
ther  experimentation  is  needed  and  when  it  is  posi¬ 
tively  unwelcome.  An  effective  example  of  these  two 
modes  of  operation  is  Mel’s  MURPHY  [Mel,  1989]. 
This  learns  the  inverse  differential  kinematics  by  firstly 
“flailing”  the  arm  in  front  of  its  camera  and  then  mak¬ 
ing  plans  based  on  the  results  of  the  observations. 

Intermediate  schemes  are  possible  in  which  limited 
experimentation  is  permitted  by  means  of  limited  ran¬ 
dom  perturbations  of  the  best  known  action,  again 
with  experimentation  under  the  control  of  a  supervi¬ 
sor. 

For  a  dynamic  system  with  a  large  state  space  the 
approach  of  having  a  controlled  experimentation  level 
has  some  drawbacks.  Aside  from  the  requirement  for 
an  intelligent  supervisor,  the  major  problem  is  that 
experimentation  is  undirected.  With  a  state  space  of 
non-trivial  dimension,  much  of  the  data  collected  from 
experiments  will  be  irrelevant  to  the  task,  leading  to 
an  unnecessarily  low  learning  rate. 

In  this  section  I  present  an  action  choosing  mech¬ 
anism  which  is  able  to  take  advantage  of  the  explicit 
exemplar  representation  to  decide,  independently  of 
supervision,  whether  an  experimental  action  is  war¬ 
ranted.  Given  a  set  of  candidate  experiments,  it  also 
provides  an  estimate  of  which  is  the  most  promising. 
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An  initial  possible  action  ai  can  be  obtained  us¬ 
ing  the  partial  inversion  mechanism  described  in  Sec¬ 
tion  2.1.  The  behaviour  of  ai  is  estimated  as  that  of 
the  closest  exemplar  to  (s,,ai).  If  this  predicted  be¬ 
haviour  is  within  r  of  6,,  where  r  is  a  task-specific 
tolerance^,  then  oi  is  chosen. 

If  oi  is  inadequate,  then  it  is  necessary  to  experi¬ 
ment.  Instead  of  making  an  entirely  random  guess  out 
of  the  space  of  possible  actions,  the  controller  makes 
an  educated  guess.  It  generates  several  alternative 
random  actions,  and  chooses  the  one  with  the  high¬ 
est  estimated  probability  of  success.  This  estimate  is 
obtained  by  the  following  heuristic: 

ProbSuccess(a)  = 

Prob(/(5j ,  a)  is  within  r  of  6j)  (5) 

The  behaviour  at  the  unknown  point  (sj,a)  is  mod¬ 
elled  as  a  gaussian  random  variable  with  prop¬ 
erties  depending  on  the  nearest  known  exemplar 
(snean  Onean  ^near,  ^nTar®^’')-  The  expected  behaviour  is 
The  standard  deviation  is  proportional  to  the 
distance  to  (sncar linear)-  This  models  the  increasing 
uncertainty  as  we  move  further  from  a  known  exem¬ 
plar.  The  constant  of  proportionality  C  reflects  how 
smooth  the  designer  imagines  the  PSTF  function  will 
be. 


ProbSucce88(a)  = 


.  "  J 


[ 


Where  (r  =  C  |  (sn«ar.  Onear)  -  (sfio)  I-  The  heuristic 
is  not  brittle  to  the  choice  of  C:  the  ranking  between 
good  and  bad  candidate  actions  is  changed  very  little 
as  C  varies  within  a  factor  of  100. 

To  illustrate  the  use  of  the  heuristic,  imagine  that 
S,  A  and  B  are  all  l-d  spaces,  and  that  we  are  in  state 
Sj  =  3.3  and  the  desired  6,  is  11.  Assume  further 
that  the  only  previous  experience  is  that  when  5=3 
and  a  =  7  then  6  =  10.  The  use  of  the  heuristic, 
with  a  tolerance  of  0.5,  is  demonstrated  in  Figure  3.  It 
compares  three  possible  actions  and  chooses  one  which 
is  fairly  near  to  7,  so  that  the  behaviour  will  be  fairly 
near  to  10  (and  thus  might  be  close  to  the  goal,  11),  but 
not  so  near  to  7  that  the  behaviour  will  be  extremely 
close  to  10. 

Initially  it  is  likely  that  for  many  states  the  only  pre¬ 
vious  experience  will  be  negative.  Then  this  heuristic 
favours  those  actions  which  are  as  far  away  as  possible 
from  any  yet  applied.  This  is  because  the  probability 
of  success  for  actions  which  are  close  to  a  previous  un- 
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far  away  have  a  probability  of  success  which  is  merely 
low.  Thus  a  wide  variety  of  actions  will  be  generated 
until  some  relatively  promising  ones  are  discovered*. 

I  have  defined  a  measure  for  scoring  candidate  ac¬ 
tions,  but  have  not  provided  a  method  for  obtaining 


is  not  an  implicit  experimentation  level.  For  exam¬ 
ple,  if  high  accuracy  is  required  a  high  tolerance  can  be 
specified  right  from  the  start. 

‘Notice,  though,  that  if  the  promising  actions  were  mis¬ 
leading,  for  example  in  a  local  minimum,  then  upon  re- 


Figure  3:  Choosing  an  action  which  is  likely  to  cause  be¬ 
haviour  in  the  range  10.5  to  11.5.  Top:  Action  very  close 
to  earlier  exemplar  in  which  behaviour  was  10.  Middle; 
Action  fairly  close.  Bottom:  Action  very  different.  The 
middle  case  is  most  promising  because  the  top  action  is 
very  likely  to  have  behaviour  close  to  10,  and  the  bottom 
action  could  have  almost  any  behaviour. 


that  which  has  the  highest  score.  This  would  require  a 
search  of  all  possible  actions,  which  real  time  response 
does  not  permit.  Instead,  the  favourite  of  a  sample 
of  randomly  generated  actions  is  used.  A  large  sam¬ 
ple  can  be  expected  to  provide  an  action  close  to  that 
which  would  be  recommended  by  exhaustive  search, 
but  with  the  penalty  of  more  computation  per  action. 
It  should  be  noted  that  once  a  successful  action  has 
been  discovered,  computation  becomes  trivial,  as  this 
will  be  the  original  action,  aj ,  returned  by  partial  in¬ 
version. 

Even  if  the  action  is  only  ever  chosen  from  a  small 
number  of  candidates,  if  one  performs  a  series  of  trials 
all  with  the  same  s,  and  6j,  then  (subject  to  being 
attainable)  Oj  will  eventually  converge  to  a  value  which 
achieves  the  tolerance.  However,  the  state,  Sj,  of  a 
dynamic  manipulator  cannot  be  expected  to  remain 
constant.  The  failure  or  success  of  the  control  choice 
of  the  previous  time  step  may  no  longer  be  relevant. 
It  is  a  result  of  the  explicit  storage  of  the  exemplars 
that  the  performance  can  nevertheless  be  expected  to 
improve  because  information  from  all  those  occasions 
111  till;  leatiiliig  liistury  vvliiui  u>c  outreiiliy  relevant  will 
still  be  available. 

4  Proximal  Control 

I  have  explained  how  the  combined  perspective,  kine¬ 
matics  and  dynamics  models  and  their  inverses  can 

peated  trials,  the  heuristic  would  eventually  once  again 
favour  further  points,  because  all  experiments  in  the  local 
vicinity  would  be  very  close  to  actions  which  had  failed  to 
meet  the  tolerance. 
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be  learned,  and  how  the  sab-tree  performance  element 
is  used  to  choose  actions  given  the  current  state  and 
desired  behaviour.  Here,  I  explain  how  a  high  level 
controller  for  a  particular  task  can  make  use  of  these 
facilities. 

Task  specifications  and  also  plans  to  achieve  these 
specifications  can  take  place  in  the  abstract  domain  of 
what  the  robot  perceives.  It  is  not  necessary  to  reason 
in  the  concrete  domain  of  joint  angles,  joint  torques 
and  feedback  gains.  Thus,  for  example,  the  output  of 
a  plan  to  move  the  end-point  towards  an  observed  goal 
can  simply  be  that  the  observed  en '-point  position 
should  start  to  move  towards  the  observed  goal. 

Because  the  underlying  model  will  adapt  to  changes 
in  the  environment,  and  because  the  drsign  of  the  con¬ 
troller  is  now  a  fairly  simple  task  (c.f.  examples  in  Sec¬ 
tion  5),  there  is  less  motivation  to  make  the  high  level 
controller  learn.  Instead,  at  each  control  cycle,  it  spec¬ 
ifies  the  acceleration  which  it  would  like  to  be  applied 
to  the  end-point.  Each  component  of  the  acceleration 
vector  can  be  treated  separately.  Writing  Xi{t)  as  the 
position  of  the  ith  proximal  state  variable  at  time  t, 
as  its  velocity  and  a,(<)  as  its  acceleration  gives 
the  trivial  control  equations: 

=  t'i(^o)+  /  ai{t)dt  (7) 

Jto 

f*' 

Xi{U)  =  x,(<o)+  /  Vi{t)dt  (8) 

Jto 

Examples  of  this  sort  of  controller  are  given  in  the 
following  section. 

5  Experimental  Results 

These  experiments  use  a  dynamic  simulation  of  a  pla¬ 
nar  two  jointed  robot  arm  moving  under  gravity.  The 
dynamic  model  is  from  (Fu  et  ai,  1987]. 

5.1  Tracking  a  moving  point 

The  task  is  to  track  a  point  moving  at  constant  speed 
along  an  anticlockwise  circle  in  proximal  coordinates. 
Figure  4  shows  the  initial  state  of  the  arm,  and  the 
target  trajectory. 

At  each  time  step  we  can  observe  the  position  and 
velocity  of  the  image  of  the  arm’s  endpoint.  From  the 
task  specification  we  can  obtain  the  ideal  image  po¬ 
sition  in  two  time  steps.  We  consider  each  variable 
separately.  Let  xq  be  the  perceived  current  position  of 
one  such  variable,  and  vq  its  perceived  current  veloc¬ 
ity  If  v/e  apply  constant,  acceleration  oq  fo*"  time 
step,  followed  by  acceleration  oi  for  the  next,  then 
from  Equation  7  we  obtain  the  perceived  position  and 
velocity  in  two  time  steps: 


X2  =  xo  +  2vo  +  ^(3ao-»-ai) 

V2  =  uo  +  ao  +  ai  (9) 

We  insert  the  ideal  X2  and  V2  to  obtain  the  ideal 
acceleration: 


Figure  4:  Two  jointed  planar  atm  acting  under  gravity.  Its 
task  is  to  follow  the  circular  trajectory. 


ao=X2-XQ-  -(3«o  +  V2) 


(10) 


The  ideal  acceleration  is  computed  for  each  variable 
independently.  The  proximal  acceleration  vector  is  the 
behaviour  which  is  passed,  along  with  the  current 
proximal  state  Sj,  to  the  sab  control  choice  mecha¬ 
nism  of  Section  3.  The  resulting  Og  is  a  torque  vector 
which,  it  is  believed,  will  produce  the  proximal  acceler¬ 
ation.  If  these  torques  are  too  large,  the  largest  torque 
in  the  same  direction  is  applied.  After  these  torques 
have  been  applied  the  actual  proximal  acceleration  is 
observed.  A  new  exemplar  consisting  of  the  previous 
state,  the  torques  and  the  actual  acceleration  is  added 
to  the  sab-tree. 


Figure  5  is  of  the  first  nine  journeys  around  the  cir¬ 
cle.  At  the  start  of  the  first  journey  there  is  no  knowl¬ 
edge  of  the  PSTF.  It  does  not  take  very  long  at  all 
before  some  very  rough  control  is  established,  tending 
at  least  to  provide  thrusts  in  the  correct  proximal  di¬ 
rection,  if  not  accurately.  But  within  only  a  very  short 
time  it  in  fact  remains  very  close  to  the  target  point. 

Figure  6  is  of  the  first  nine  journeys  around  the  cir¬ 
cle  with  a  substantial  noise  component  added  to  the 
arm  dynamics.  The  rate  of  learning  is  slower  because 
the  initial  exemplars  are  misleading  but  eventually 
the  exemplars  become  sufficiently  numerous  that  the 
smoothed  values  represent  the  ideal  behaviour.  The 
performance  will  never  leach  that  of  the  original  ex¬ 
ample,  because  random  noise  is  always  being  added. 

Figure  7  demonstrates  how  the  controller  adapts 
when  the  arm  dynamics  change  suddenly  during  the 
fourth  circuit®.  Initially  the  performance  is  disastrous, 
but  it  gradually  recovers.  It  takes  several  cycles  to  re- 


®Tlie  actuator  at  joint  two  suddenly  starts  providing 
twice  the  original  torque 
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Figure  5:  IVajectory  of  arm  during  first  nine  circuits,  read 
from  left  to  right  down  the  figure.  The  hand  was  initially 
near  the  centre  of  the  circle. 


Figure  6:  Trajectory  of  arm  with  substantial  random  noise 
added  to  the  manipulator  dynamics. 


Figure?:  Trajectory  of  arm.  Manipulator  dynamics  change 
during  the  fourth  circuit.  By  the  eleventh  circuit  (not 
shown)  the  control  was  entirely  recovered. 


Figure  8:  Structuring  the  Volley  task.  The  tasks  are  shown 
in  rectangles,  sof>-trees  are  shown  in  ovals. 


cover  because  the  disasters  mean  that  it  doesn’t  imme¬ 
diately  re-experience  all  the  areas  near  the  trajectory. 

Each  experiment  took  less  than  10  minutes  real  time 
for  the  simulation  and  learning. 

5.2  Underarm  Volleying:  A  Compound  Task 

For  this  task,  the  simulated  arm  is  given  a  bat  which 
is  fixed  at  right  angles  to  the  arm’s  second  linkage.  A 
ball  is  fired  towards  the  arm.  The  task  of  the  arm  is 
to  volley  the  ball  so  that  it  lands  in  a  bucket  placed 
nearby.  The  visual  location  of  the  bucket  and  ball  can 
be  obtained. 

This  complex  task  can  be  structured  into  the  hier¬ 
archy  of  subtasks  shown  in  Figure  8. 


The  overall  task,  called  Volley,  is  broken  into  three 
subtasks.  Predict  estimates  the  ball’s  behaviour. 
Strike  brings  the  bat  to  contact  the  ball  at  a  con¬ 
trolled  position  and  speed.  Return  holds  the  bat 
steady  after  the  strike. 

The  Predict  task  models  the  perceived  behaviour 
of  the  ball  prior  to  being  hit.  This  model  is  learned 
by  a  sa6-tree,  which  estimates  the  time  until  the  ball 
arrives  within  range  of  the  arm  and  the  state  of  the 
ball  at  this  point.  It  is  a  mapping 
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which  is  updated  once  every  trial.  There  is  no  control 
over  this  aspect  of  the  ball’s  behaviour  so  A,  the  space 
of  control  actions,  is  empty. 

The  Return  task  simply  computes  a  proximal  accel¬ 
eration  to  thrust  the  endpoint  towards  the  stationary 
waiting  position.  The  torques  to  achieve  this  acceler¬ 
ation  are  computed  and  executed  by  the  low  level  Acc 
task. 

The  Strike  task  controls  the  ball  indirectly  by 
means  of  a  collision  between  the  ball  and  the  bat.  It 
requires  a  model  of  the  real  world: 


Ball  at  Hit  X  Bat  position  and  direction  X 


This  too  can  be  learned  using  a  safr-tree.  From  trial 
to  trial,  the  Strike  task  attempts  to  always  position 
the  bat  to  contact  the  oncoming  ball  at  the  bat’s  cen¬ 
tre,  and  to  have  it  moving  in  the  same  relative  di¬ 
rection  at  impact.  The  speed  of  the  bat  at  impact 
is  varied.  This  speed  affects  the  landing  position  of 
the  ball,  and  so  can  be  used  to  indirectly  control  the 
landing  position.  At  the  trial  start  the  safr-search  is 
given  Sj  as  the  estimated  ball  state  when  it  arrives 
in  range  and  the  intended  bat  position  and  direction. 
The  search  produces  a  recommended  speed  for  the  bat 
at  impact. 

It  should  be  noted  that  this  so6-tree,  like  the  oth¬ 
ers,  is  simply  a  set  of  objective  observations  about  the 
world,  and  its  own  accuracy  does  not  depend  on  the 
performance  of  the  subtasks  which  are  being  learned. 
There  is  no  blame  or  credit  assignment  problem.  For 
example,  suppose  that  we  believed  that  hitting  speed 
Si  at  position  Pi  would  ensure  that  the  ball  landed  at 
Xi ,  but  due  to  a  low  level  error  the  ball  was  volleyed 
with  the  correct  speed  Si,  but  at  the  wrong  position 
P2.  The  ball  then  lands  at  X2.  The  speed  Si  will  not 
now  be  wrongly  associated  with  landing  at  -Y2:  the 
sat-tree  will  simply  contain  an  observation  that  hit¬ 
ting  with  speed  Si  at  the  wrong  position,  P2,  results 
in  a  landing  at  X2,  and  will  contain  no  explicit  predic¬ 
tion  as  to  what  would  happen  were  the  ball  hit  at  the 
correct  position. 


The  Meet  task  guides  the  initial  proximal  state 
to  the  target  impact  state.  Working  in  the  proxi¬ 
mal  space,  it  can  do  this  for  each  state  variable  in¬ 
dependently.  The  j'th  variable  has  a  current  state 
(x,  (0),  v,(0))  and  the  controller  must  invent  a  sequence 
of  accelerations  so  that  at  the  predicted  time  tgoai  of 
the  ball’s  arrival  the  state  will  be  (igoaii  Vgoai)-  In  fact 
this  is  easy.  Consider  accelerating  with  acceleration 
Go  for  <1  seconds,  then  with  acceleration  -ao  for  <2 
seconds.  Then  there  are  three  equations  in  three  un¬ 
knowns:  Equations  7  and  8  with  constant  acceleration 


and  the  constraint  that  fj  +<2  =  tgoai-  The  unknowns 
are  ao,  <1  and  <2-  Finding  the  value  of  co  to  achieve 


this  is  straightforward  scalar  algebra,  which  amounts 
to  solving  a  quadratic  equation.  The  ideal  proximal 
acceleration  can  thus  be  recalculated  on  every  time 
step  to  account  for  earlier  errors. 


Figure  10:  A  successful  volley.  As  well  as  modelling  its 
own  arm,  the  controller  has  correctly  predicted  where  the 
ball  will  be  when  it  is  in  range  and  found  a  correct  speed 
with  which  to  hit  the  ball  back  to  the  bucket. 

The  Acc  subtask  is  the  task  used  in  the  circle  tracing 
experiments  to  derive  a  torque  from  the  PSTF  which 
achieves  the  requested  proximal  acceleration. 

Figures  9  and  10  show  the  behaviour  of  the  arm 
during  an  early  and  late  trial  respectively. 
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Figure  11  graphs  the  performance  of  a  relatively  easy 
task,  where  the  ball  is  always  fired  from  the  same  posi¬ 
tion  and  velocity  and  the  bucket  always  remains  in  the 
same  place.  After  five  trials  a  suitable  hitting  speed  is 
discovered,  and  a  value  close  to  this  speed  is  used  for 
subsequent  trials. 


6 


4S 


Figure  11:  Histogram  of  distance  from  bucket  against 
trial  number.  The  successful  volleys,  which  landed  in  the 
bucket,  are  shown  in  white.  During  these  trials  the  bucket 
was  fixed  and  the  ball  always  fired  with  the  same  speed 
and  direction. 

Figure  12  displays  the  results  of  the  considerably 
harder  task  in  which  the  ball  is  fired  at  a  random  speed 
for  each  trial,  and  the  bucket  is  placed  in  a  random  po¬ 
sition.  It  requires  approximately  twenty  trials  before 
the  behaviour  can  be  said  to  be  fairly  skilled. 

Figure  13  shows  the  behaviour  when  the  bucket  is 
placed  randomly  and  the  ball  is  fired  with  a  random 
speed  and  direction.  There  is  an  improvement  in  be¬ 
haviour  but  the  probability  of  failure  is  still  roughly 
20%  even  after  over  100  trials.  This  is  because  even 
then  there  is  still  a  fair  probability  that  the  starting 
state  of  the  ball  is  sufficiently  far  from  any  previous  ex¬ 
perience  that  the  behaviour  predicted  by  the  nearest 
neighbour  is  inadequate. 

6  Conclusion 

The  experiments  in  the  previous  section  all  took  only 
a  small  amount  of  real  time.  Both  the  rate  of  learning, 
and  the  information  processing  were  fast.  For  example, 
in  the  circle  tracking  task,  after  only  twenty  state- 
action-behaviour  observations,  the  arm  was  already 
under  some  sort  of  rough  control"*.  The  primary  reason 
for  this  is  that  only  one  presentation  of  a  sample  of 
data  is  required  for  the  knowledge  contained  in  the 
data  to  be  fully  represented.  This  can  be  contrasted 

*Tlie  direction  in  which  the  hand  was  accelerated  was 
already  on  average  within  30  degrees  of  the  direction  re¬ 
quested  by  the  proximal  controller. 


Figure  12:  Histogram  of  distance  from  bucket  against  trial 
number.  Before  each  trial  the  bucket  was  placed  at  a  ran¬ 
dom  position.  The  ball  was  always  fired  with  random  speed 
but  constant  direction. 
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Figure  13:  Histogram  of  distance  from  bucket  against  trial 
number.  Before  each  trial  the  bucket  was  placed  at  a  ran¬ 
dom  position.  The  ball  was  always  fired  with  random  speed 
and  random  direction. 


to  sn  stpprostch  in  which  the  knov/lcdgc  is  rsprsssntsd 
indirectly,  perhaps  as  a  set  of  weights  in  a  network. 
When  each  exemplar  is  presented,  all  or  some  of  the 
weights  are  modified  according  to  a  rule  which  will  only 
eventually,  on  repeated  future  presentations  of  similar 
exemplars,  converge  to  a  representation  of  the  same 
knowledge. 

A  second  reason  for  the  fast  rate  of  learning  is 
that  the  mappings  which  are  learned  are  objective 
observations  about  the  real  world,  rather  than  task- 
specific  knowledge  of  which  responses  are  “right”  or 
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“wrong”.  This  means  that  the  knowledge  obtained 
from  a  “wrong”  response  in  one  context,  might  in  an¬ 
other  context  or  task  be  useful  positive  information. 
For  example,  when  an  erroneous  volley  misses  the  tar¬ 
get,  the  controller  has  the  consolation  that  should  the 
target,  in  a  similar  situation,  ever  be  near  the  point 
at  which  the  ball  in  fact  landed,  it  will  know  about  a 
potentially  successful  volley. 

The  performance  element,  the  sad-tree,  was  updated 
and  interrogated  in  real  time.  The  small  overall  times 
for  the  experiments  indicate  that  the  speed  of  near¬ 
est  neighbour  searching  is  adequate.  The  exemplars 
obtained  when  performing  a  task  tended  to  be  dis¬ 
tributed  extremely  unevenly,  which  helps  the  nearest 
neighbour  search  considerably.  With  the  decreasing 
cost  of  memory  and  relatively  fast  processors  it  should 
be  entirely  practical  to  search  trees  in  the  order  of  a 
million  exemplars.  This  is  the  size  of  a  tree  obtained 
from  monitoring  a  robot  constantly  for  over  eleven 
days  with  one  observation  a  second.  Furthermore,  it 
is  likely  that  the  size  of  the  trees  could  be  reduced  an 
order  of  magnitude  by  only  storing  those  exemplars 
which  were  predicted  incorrectly  prior  to  their  obser¬ 
vation,  but  this  has  not  been  investigated. 

As  a  result  of  being  able  to  reason  in  proximal  space, 
the  low  level  control  can  become  straightforward.  The 
behaviour  of  the  controller  becomes  easier  to  under¬ 
stand  when  tasks  are  specified  in  relatively  abstract 
terms  such  as  “move  up”  or  “accelerate  towards  this 
point”,  instead  of  concrete  joint  space  terms.  Just  as 
a  compiler  allows  a  programmer  to  think  in  abstract 
terms  without  needing  to  know  about  the  underlying 
machine  language,  so  sa6-tree  learning  permits  the  sys¬ 
tem  designer  to  devise  control  strategies  without  need¬ 
ing  to  know  about  the  specific  parameters,  geometry 
or  dynamic  behaviour  of  the  robot,  even  if  these  may 
change  over  time. 

By  structuring  a  more  complex  task  as  a  hierarchy 
of  subtasks,  the  higher  levels  of  the  controller  can  also 
become  relatively  easy  to  implement.  When  models  of 
the  real  world  are  required  for  these  higher  tasks,  sot- 
trees  can  once  again  be  used.  However,  the  structuring 
needs  to  be  performed  by  some  expert,  presumably  hu¬ 
man.  In  order  to  be  able  to  classify  the  controller  as 
truly  autonomous,  it  would  have  to  be  able  to  devise 
the  structuring  strategy  itself.  At  this  stage  the  lack  of 
full  autonomy  is  a  compromise,  justified  by  the  obser¬ 
vation  that  a  large  proportion  of  the  effort  in  robotic 
control  using  conventional  methods  is  not  in  inventing 
abstract  strategies,  but  in  modelling  the  world. 
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Abstract 

We  study  the  feasibility  of  adaptive  pat¬ 
tern  recognition  of  robotic  tactile  impres¬ 
sions  using  connectionist  models.  This  pa¬ 
per  presents  interim  simulation  results  of  cou¬ 
pled  back-error  propagation  (BEP)  networks 
that  (i)  extract  relative  gradient  features  via 
data  compression,  (ii)  clusters  families  of 
grey-scale  patterns  constrained  by  geometry, 
size,  and  activation  levels,  and  (iii)  classifies 
these  surface  profiles  to  pre-specified  cate¬ 
gories.  The  constraints  imposed  on  the  train¬ 
ing  data  are  designed  to  capture  the  essence 
of  tactile  patterns  and  force  the  artificial 
neural  systems  (ANS)  to  extract  useful  fea¬ 
tures.  Receptive  field  (rather  than  fully  con¬ 
nected)  processing  units  are  used  to  encode 
subtle  features  among  their  activation  pat¬ 
terns.  This  work  initiates  ANS  applications 
in  the  tactile  domain  and  reveals  basic  char¬ 
acteristics  of  BEP  networks  to  highly  con¬ 
strained  training  data. 

1  Introduction 

Numerous  techniques  are  well  known  for  pattern  classi¬ 
fication  given  an  appropriate  set  of  input  features,  but 
the  determination  of  such  features  remain  the  central 
challenge  in  many  pattern  recognition  processes.  Ef¬ 
ficient  solutions  for  automatic  (machine)  derivation  of 
these  features  have  been  elusive.  We  explore  the  pos¬ 
sible  benefits  of  using  artificial  neural  systems  (ANS) 
for  feature  extraction,  clustering,  and  classification  of 
relative  gradients  within  2D  tactile  impressions. 

In  robotic  applications,  tactile  sensing  is  critical 
when  a  machine  is  required  to  perform  dexterous  ma¬ 
nipulations  and  precise/fragile  assembly  tasks.  Also, 
for  object  detection  or  identification  purposes  tac¬ 
tile  sensors  can  complement  or  substitute  for  more 
costly  vision  sensors  [whenever  contact  can  be  made]. 


Presently,  a  variety  of  tactile  sensor  technologies  ex¬ 
ist:  conductive  elastomers,  ferroelectric  polymers,  op¬ 
toelectronic  sensors,  and  silicon  strain  gauges  [Nicholls 
and  Lee,  1989].  Reliable  sensor  data  can  be  obtained 
from  some  of  the  industrial -quality  products.  Sensors 
vary  in  size,  shape,  and  resolution  for  mounting  on  the 
“fingers”  of  a  robot  or  atop  a  work-surface. 

Research  and  development  in  tactile  sensing  com¬ 
prises:  (i)  biological/physiological  studies,  (ii)  design 
of  artificial  sensors,  (iii)  planning  and  control  of  hap¬ 
tic  perception  to  acquire  object  data,  and  (iv)  pattern 
recognition  of  the  tactile  data.  The  work  described  be¬ 
low  is  restricted  to  topic  (iv)  -  processing  and  classi¬ 
fication  of  tactile  impressions.  Past  works  in  this  area 
have  primarily  been  limited  to  applying  “borrowed” 
algorithms  from  image  processing  to  the  tactile  do¬ 
main.  These  methods  work  well  for  classification  of 
simple,  binary  silhouettes,  but  are  inadequate  for  real¬ 
time  applications  in  more  complex  domains.  We  begin 
by  investigating  the  domain  of  a  planar  sensor  matrix 
and  techniques  for  processing  its  data  in  the  connec¬ 
tionist  realm. 

2  The  Training  Data 

True  tactile  patterns  comprise  both  the  geometric 
shapes  of  the  object  as  well  as  the  surface  contours 
derived  from  3D  force  and  torque  distributions  on  the 
sensor.  Although  specific  sensor  designs  may  vary, 
each  ‘tacel’  provides  at  least  the  normal(z)  force  read¬ 
ing,  and  some  advanced  designs  yield  all  3D  force  com¬ 
ponents.  The  force  experienced  by  each  tacel  is  con¬ 
verted  into  a  grey-scale  number,  and  the  collection 
of  values  from  all  tacels  form  the  “image”  to  be  an¬ 
alyzed.  Embedded  information  from  such  data,  possi¬ 
bly  combined  with  the  global  3D  moments  experienced 
by  the  sensor,  must  be  manipulated  to  derive  a  “sense 
of  touch”  in  robotic  applications. 

We  begin  with  a  small  training  set  based  on  nine 
surface  contours  depicted  in  Figure  1.  When  pressed 
onto  a  compliant  tactile  sensor,  these  rigid  surface 
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profiles  would  produce  the  corresponding  impressions 
shown  in  Figure  2.  There  are  six  surface  profiles  of 
rectangular  silhouette  and  three  of  circular  silhouette 
named  and  tagged  as:  bar(B),  rod(R),  wedge(W), 
slant(L),  bump(U),  boIe(H);  and  spheTe(S),  cylin- 
der(C),  ciTjshnt(I)  respectively. 


Figure  1:  Surface  Primitives 
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Figure  2;  Corresponding  Tactile  Impressions 

The  sensor  response  reflects  Lord  Corporation’s 
LTS200  model  tactile  sensor  [Rebman  and  Trull,  1983; 
Muthurkishnan  et  ai,  1987],  but  similar  data  can  be 
obtained  from  other  high  quality,  robust  sensors.  The 
impressions  in  Figure  2  assume  uniform  pressure  dis¬ 
tribution  on  the  solid  objects,  and  that  the  entire  bot¬ 
tom  surface  of  the  object  is  in  contact  with  the  sensor. 
The  squares  are  graphical  icons  representing  tacels, 
and  their  sizes  are  directly  proportional  to  the  amount 
of  normal  force  experienced  at  each  site. 


Within  each  data  group  (rectangular  or  circular)  the 
size  and  shape  of  the  contact  surfaces  are  identical  so 
that  the  ANS  must  discriminate  the  patterns  based 
on  intensities  of  the  force  experienced  by  the  tacels 
rather  than  the  object’s  size  or  geometry.  Moreover, 
the  ‘global’  or  ensemble  force  experienced  by  the  ma¬ 
trix  of  tacels  are  held  approximately  constant  in  order 
to  force  the  ANS  to  distinguish  the  relative  gradient 
patterns  rather  than  the  overall  pressure  on  the  sensor. 
These  constraints  are  important  to  ensure  extraction 
of  useful  features,  but  do  not  imply  that  all  tactile 
data  embody  such  characteristics;  we  study  the  more 
restrictive  and  difficult  case  and'  contend  that  relax¬ 
ing  any  one  of  the  constraints  (eg.  different  geome¬ 
tries)  would  pose  a  simpler  feature  extraction  problem. 
Previous  works  in  tactile  pattern  recognition  described 
processing  of  binary  patterns  and/or  differing  geome¬ 
tries  which  do  not  focus  and  capture  the  essence  of 
tactile  data.  [Hering  ei  ai,  1990]  [Muthurkishnan  et 
ai,  1987]  [Marik,  1981]  [Kadonoff,  1983]  [Togai,  1982] 
[Hillis,  1982]  [Sato  et  ai,  1977]  [Takeda,  1974]  [Ki- 
noshita,  1973] 

3  Network  Description 

In  this  study,  the  domain  of  ANS  is  a  controlled  en¬ 
vironment  such  as  a  robot  workcell  in  which  several 
regular  objects  are  assembled  with  the  aid  of  a  pla¬ 
nar  tactile  sensor.  Thus,  the  ANS  are  not  required 
to  create  categories  from  random,  arbitrary  input  pat¬ 
terns;  in  fact,  during  training,  the  desired  output  cate¬ 
gories  would  be  specified  and  associated  with  a  subse¬ 
quent  task.  As  a  result,  a  more  direct  architecture  of 
a  layered  feed-forward  back-error-propagation  (BEP) 
network  is  chosen  rather  than  one  based  on  Adaptive 
Resonance  Theory.  The  ‘standard’  BEP  network  is  a 
nonparametric  classifier,  and  its  algorithm  is  based  on 
minimizing  an  error  function  EriW)  =  ^p{^)  via 

recursive  computation  of  the  error  signal  6  with  gra¬ 
dient  descent  in  weight-space  [Rumelhart  et  ai,  1986]. 
The  simulations  were  performed  using  the  Rochester 
Connectionist  Simulator  [Goddard  et  ai,  1989]. 

We  first  study  the  problem  of  processing  static  tac¬ 
tile  impressions  from  one  planar  sensor  matrix,  and 
at  this  juncture,  assume  that  object  impressions  are 
totally  contained  within  the  sensor.  It  is  desirable  to 
achieve  pattern  recognition  invariant  to  translations 
and  rotations  of  the  tactile  impressions.  (An  object 
of  different  scale  could  effectively  be  of  another  class, 
since  its  handling  and  subsequent  operations  may  dif¬ 
fer.)  In  this  initial  phase,  we  impose  only  the  trans¬ 
lation  invariance  requirement  and  use  a  2D  Fourier 
transform  as  a  preprocessor.  The  network  architec¬ 
ture  is  shown  in  Figure  3. 

The  first  BEP  module  extracts  salient  features 
among  the  training  patterns  and  the  second  module 
forms  decision  boundaries  between  them  for  catego- 
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Classifier 


I  Teach  I 


Figure  3:  Network  Architecture 


rization.  The  ovals  in  the  Fourier  spectrum  are  to  in¬ 
dicate  that  receptive  fields  units  are  used  rather  than 
the  standard  fully  connected  BEP  scheme.  The  input 
to  the  feature  extraction  network  is  quadrant-one  of 
the  corresponding  object’s  Fourier  spectrum  as  shown 
below. 
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Figure  4;  Input  Vectors:  Partial  Fourier  Spectra 

The  spectra  of  the  rectangular  patterns  (top  two 
rows  in  Figure  4)  and  the  circular  patterns  are  dis¬ 
tinctive  as  a  group  indicating  that  differences  in  ge¬ 
ometry  pose  a  relatively  easy  classification  problem. 
Within  each  group,  however,  the  distinguishing  fea¬ 
tures  are  more  subtle  and  pose  a  challenging  learning 
problem  for  the  ANS.  Since  the  spectra  of  high  fre¬ 
quency  terms  decrease  rapidly,  the  function  D{u,  i/)  = 
log(l  -f-  |F(«,  i/)|)  is  used  instead  of  |F(u,  u)\  to  display 
the  image  and  preserve  the  zero  values  in  the  frequency 
plane.  The  FFT  output  yields  translation  invariance, 
since  its  magnitude  (F(u,  is  not 

affected  by  a  shift  in  the  time  domain. 

The  hidden  layer  in  the  feature  extraction  module 
receive  input  from  a  small  neighborhood  of  input  units. 
The  exact  number  of  hidden  units  would  depend  on 


the  size  of  the  input  vector  and  the  receptive  fields; 
we  find  that  the  mote  similar  the  input  vectors  are, 
smaller  receptive  fields  are  required  to  encode  minor 
feature  differences,  and  thus  more  hidden  units  would 
be  created.  The  connectivity  issue  is  discussed  further 
in  Section  4.2.  The  output  units  receive  the  same  in¬ 
formation  and  connectivity  pattern  as  the  input,  since 
the  feature  extraction  BEP  network  receives  no  exter¬ 
nal  teach  vector. 

4  Feature  Extraction 

Saund  [Saund,  1986]  presented  some  theoretical  back¬ 
ground  for  using  connectionist  networks  to  discover 
constraints  in  multidimensional  data  via  dimensional 
reduction.  A  subset  m-dimensional  features  embed¬ 
ded  in  n-dimensional  data  source  can  be  abstracted 
into  hidden  units  of  a  BEP  network.  Cottrell  et.  al 
[Cottrell  ei  a!.,  1987],  and  Kuczewski  et.  al  [Kuczewski 
et  al.,  1987]  reported  on  using  BEP  network  as  a  self¬ 
organization  structure  in  image  compression /data  re¬ 
duction.  The  feature  extraction  module  in  Figure  3  is 
based  on  these  same  principles  and  background  con¬ 
cepts  are  omitted  here,  except  to  reiterate  that  no  ex¬ 
plicit  external  “teacher”  is  used  in  this  scheme;  the 
network  maps  input  patterns  onto  themselves  while 
performing  data  compression.  In  the  initial  simula¬ 
tions,  the  feature  extractor  module  compacted  salient 
features  among  ten  64-dimensional  input  vectors  into  8 
feature  (hidden)  units.  Kuczewski  demonstrated  more 
significant  reduction  ratio  from  255  to  3  dimensions 
in  a  four-layer  network;  here,  the  specific  constraints 
of  the  input  data  in  the  tactile  domain  are  of  inter¬ 
est  to  us.  Discrimination  of  tactile  impressions  pose 
an  interesting  task  for  the  ANS,  and  empirical  results 
indicate  that  (i)  ANS  abstract  the  simplest  (but  not 
necessarily  useful)  criteria  while  encoding  features,  un¬ 
less  proper  constraints  are  imposed  on  the  input  data, 
and  (ii)  the  standard  fully-connected  paradigms  of  the 
BEP  architecture  may  not  necessarily  yield  useful  data 
representation. 

4.1  Constraints  Imposition 

As  mentioned  in  Section  2,  the  patterns  in  the  train¬ 
ing  set  are  constrained  to  have  the  same  (i)  geome¬ 
try  (ii)  size,  and  (iii)  total  force  applied  to  the  sensor. 
The  intent  is  to  make  the  feature  extraction  module 
determine  that  relative  force  gradients  are  the  key  fea¬ 
tures  to  be  learned,  by  repeated  presentation  of  the 
training  set.  However,  the  features  it  does  (or  does 
not)  capture  can  be  deduced  by  testing  the  network 
in  its  “trained”  state  on  new  data.  For  example,  if 
the  training  set  itself  were  similar  to  that  in  Figure 
2,  but  the  “footprint”  of  the  rod  differed  in  shape  to 
the  bar,  and  the  cylinder  w 's  of  different  size  than 
the  sphere  (all  with  proper  pressure  distributions)  the 
network  would  err  and  exhibit  marked  sensitivity  to 
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changes  in  geometry  and  size  of  tlicse  objects  '  A  less 
conspicuous  feature  is  the  aggregate  force  applied  to 
the  sensor.  If  the  shape  and  size  are  constrained  but 
the  total  forte  applied  (reflected  by  the  sum  of  grey¬ 
scale  values  across  the  sensor)  is  different,  the  network 
will  use  the  ensemble  force  as  the  discriminating  factor 
among  those  patterns.  Hence,  the  network  misclassi- 
fles  when  the  same  gradient  profile  with  different  total 
force  is  shown  to  it.  Note  that  although  the  extracted 
features  are  found  by  the  network  and  iioi  “human- 
engineered”  ,  the  construction  of  the  training  set  from 
which  the  module  learns  must  be  carefully  crafted.  We 
conjecture,  however,  that  if  large  numbens  of  examples 
with  varying  geometry,  size  and  forces  are  shown  to  the 
network  for  each  of  the  surface  profiles,  and  the  net¬ 
work  is  allowed  to  evolve  over  a  ‘long’  period  of  time, 
it  will  discover  that  relative  force  gradients  are  the 
common  features.  However,  foi  initial  focused  studies, 
system  parameters  must  be  carefully  controlled. 

4.2  Connection  Patterns 

Even  if  proper  constraints  arc  imposed  on  the  training 
set  and  the  network  is  trained  to  extract  the  salient 
force  gradients,  the  feature  information  may  not  be 
accessible  for  training  subsequent  modules.  That  is, 
we  want  the  abstracted  features  to  be  manifested  as 
activation  patterns  across  the  internal  hidden  layer 
such  that  these  patterns  can  be  used  as  the  train¬ 
ing  vectors  for  the  classification  module.  During  self¬ 
organization,  we  compute  the  average  differences  (ab¬ 
solute  or  squared)  over  all  pixels  in  a  given  image  and 
compute  one  metric  per  training  pattern.  This  value  is 
then  compared  with  a  permissible  error  (PErr)  term 
that  determines  when  to  terminate  training.  Using  the 
conventional  fully-connected,  feed-forward  BEP  net¬ 
work  architecture,  repeated  runs  consistently  resulted 
in  a  failure  to  encode  salient  features  as  activation  lev¬ 
els  on  the  hidden  units.  The  networks  do  converge 
below  PErr,  but  all  the  feature  information  is  stored 
among  the  pattern  of  link  weights  rather  then  as  ac¬ 
tivation  levels  of  hidden  units.  A  typical  convergence 
pattern  is  shown  below  in  Table  1 . 


Table  1:  Typical  Fully-connected  Convergence 


1  Hidden  Aciivaitons  (al!  9  patterns)  \ 

1  .07  .06  1 

1  .00  1  .07  1  .02  1 

1  .00  i  .07  1  .05  1 

The  activation  pattern  across  all  8  units  are  the 
same  regardless  of  the  input  pattern,  thus  it  is  impos- 

*Note  that  for  robust  performance,  a  comprehensive 
training  set  should  include  corresponding  sets  of  each  ob¬ 
ject  in  various  shapes  and  si;.es,  but  this  initial  training 
“subset”  suffices  for  preliminary  studies. 


sible  to  discern  input  patterns  from  this  data.  The 
mean-absolute  and  mean-squared  differences  for  all 
patterns,  however,  indicate  that  the  reproduction  at 
the  output  layer  was  quite  faithful  to  the  input  pattern. 
Nevertheless,  this  result  is  unacceptable  since  training 
vectors  for  the  next  module  are  unavailable.  We  thus 
employ  the  concept  of  receptive  field  neurons.  With 
partial-connections  from  input  to  hidden  units  with 
a  neighborhood  radius  of  k  units  and  center-to-center 
distance  of  /  units,  the  typical  convergent  hidden  layer 
activations  are  as  shown  below  in  Table  2. 


Table  2:  Convergence  with  Receptive  Field  Units 


Hidden  Activations 

B 

.00 

.03 

.02 

.93 

.01 

.04 

.50 

.11 

R 

.01 

.02 

.02 

.96 

.01 

.08 

.38 

.04 

W 

.06 

.23 

.02 

.96 

.01 

.07 

.54 

.07 

L 

.05 

.07 

.049 

.65 

.03 

.23 

.62 

.30 

U 

.04 

.09 

.05 

.63 

.11 

.25 

.63 

.14 

H 

.11 

.05 

.18 

.50 

.07 

,62 

.60 

.45 

s 

.01 

.23 

.13 

.50 

.12 

.42 

.65 

.32 

C 

.02 

.05 

.021 

.41 

.09 

.52 

,74 

.07 

I 

.19 

.17 

.05 

.66 

.09 

.24 

.63 

.14 

The  features  among  the  patterns  are  not  distinctive 
such  that  one  is  “high”  while  others  are  “low”,  but 
results  indicate  that  based  on  even  a  small  threshold 
some  combination  of  the  8  feature  units  can  distin¬ 
guish  one  surface  profile  from  any  other  one.  In  the 
constrained  tactile  domain,  the  networks  do  not  have 
a  wide  dynamic  range  to  operate  near  boolean  lim¬ 
its;  the  differences  between  the  data  are  subtle,  and 
accordingly,  feature  unit?  converge  at  various  analog 
values  in  its  activation  range.  Preliminary  results  in¬ 
dicate  that  the  dimensionality  of  the  feature  space  n 
is  approximately  0.7  <  n  <  p  where  p  is  the  number 
of  pattern  classes.  During  fully-connected  configura¬ 
tion,  each  hidden  unit  is  receiving  input  from  the  entire 
input  pattern  of  N  units  and  is  imposed  with  extra¬ 
neous/redundant  data.  With  partial  connectivity,  not 
only  is  the  amount  of  data  reduced  to  seme  propor¬ 
tion  of  but  more  significantly,  each  unit  receives 
a  fraction  of  private  segments  of  the  each  image. 
The  ratio  of  '  (fa  k  <  i  <  2k)  controls  the  amount  of 
overlap  between  the  receptive  fields,  but  simulations 
show  that  it  does  not  have  appreciable  affect  on  fea¬ 
ture  extraction  properties.  The  important  factor  is 
the  segmentation  of  the  input  data  by  receptive  fields 
to  enable  the  network  to  discover  small  distinctions 
among  the  training  patterns. 

A  design  question  arises  as  to  how  many  feature 
(hidden)  units  are  required  and  how  large  the  receptive 
fields  should  be.  In  general,  the  number  of  required 
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features  nodes  are  not  known  a-priori,  and  it  will  de¬ 
pend  intimately  on  the  degree  of  correlation  of  the  in¬ 
put  vectors.  To  design  a  feature  extractor  network 
without  too  much  human  preprocessing  and  scrutiny, 
a  network  can  start  with  a  large  number  of  feature 
nodes,  each  with  small  receptive  fields.  After  the 
network  converges,  the  activation  patterns  across  the 
feature  units  separate  crucial  from  extraneous  units. 
The  extraneous  units  can  then  be  “pruned”  either  by 
human-inspection  or  some  automatic  algorithm  as  dis¬ 
cussed  by  Sietsma  and  Dow  [Sietsma  and  Dow,  1988]. 
The  module  described  above  started  with  16  hidden 
nodes;  it  turns  out  that  only  8  were  sufficient  to  dis¬ 
tinguish  between  the  patterns. 

5  Clustering  Groups  of  P?.ttt;rns 

Thus  far,  we  discussed  two  coupled  BEP  networks, 
that  can  extract  features  and  classify  the  basic  data 
set  of  Figure  2.  Next,  we  extend  the  problem  and  test 
if  these  ANS  can  categorize  “families”  of  such  patterns. 
We  relax  the  previous  stipulations  of  uniform  pressure 
distribution  on  the  objects  and  consider  some  distorted 
patterns.  In  advanced  robotic  applications  where  the 
robot  must  operate  without  precision  peripheral  de¬ 
vices  such  as  positioning  jigs,  consistent  acquisition  of 
flawless  sensor  data  is  highly  improbable.  On  a  tactile 
sensor,  for  instance,  an  object  is  likely  to  be  pressed 
slightly  harder  on  one  section  than  another. 

Hence,  the  training  set  is  augmented  to  include  ad¬ 
ditional  impressions  of  the  nine  objects  as  they  are 
“rolled”  slightly  to  the  left,  right,  top,  and  bottom  of 
the  objects.  We  simulate  up  to  20  %  pressure  gradients 
for  four  basic  directions  (right, left,top, bottom)  on  the 
sensor,  and  obtain  a  total  of  45  training  patterns.  The 
impressions  for  the  bar  in  various  “rolls”  are  shown  in 
Figure  5. 


Figure  5:  Distorted  ’bars’  to  Cluster 

Despite  the  distortions,  the  network  should  learn 
that  all  five  patterns  belong  to  class  bar,  and  sim¬ 


ilarly,  the  other  pattern  clusters  to  their  respective 
categories.  The  problem  becomes  much  more  difficult, 
because  the  distinctions  between  the  surfaces  become 
less  clear,  and  one  pattern  can  resemble  another  very 
closely.  For  instance,  the  impression  of  the  bar  rolled 
to  the  right  (Figure  5a)  is  now  very  similar  to  the  orig¬ 
inal  slant  (Figure  2d)  pattern.  Only  subtle  difference 
remain  between  the  patterns,  such  as  the  slightly  ac¬ 
tivated  neighboring  column  of  tacels  to  the  right  edge 
in  Figure  5a,  resulting  from  increased  deformation  of 
the  sensor  surface  there.  Likewise,  when  the  slant  is 
rolled  to  the  left,  it  resembles  the  original  bar  (Fig¬ 
ure  2a),  and  the  same  for  the  cylinder  and  cir^lant 
objects.  Note  that  the  corresponding  Fourier  spectra 
that  serve  as  the  input  vector  also  becomes  more  and 
more  similar,  and  poses  a  difficult  learning  problem  for 
the  ANS. 

This  problem  requires  the  ANS  to  cluster  the  five 
patterns  as  a  group  and  derive  cluster  centers  that 
are  as  distant  as  possible  to  distinguish  between  the 
clusters.  Clustering  process  can  be  view  as  unsuper¬ 
vised  classification  in  which  N  samples  (each  charac¬ 
terized  by  an  n-dimensional  vector)  are  classified  into 
M  classes  (wj,  ...wa/).  The  classification  D  is  a  vector 
made  up  of  the  wi,s  and  the  configuration  A'*  is  a  vec¬ 
tor  make  up  of  X,s.  The  cluster  criterion  J  then  is  a 
function  of  both  D  and  AT"; 

-u'iArF;  [Af ,  ...A^]^) 

In  seeking  data  cluster  via  standard  algorithms,  some 
measure  of  similarity  such  as  J  establishes  a  rule  for 
assigning  patterns  to  the  domain  of  a  particular  cluster 
center.  Moreover,  since  the  proximity  of  two  patterns 
is  a  relative  measure  of  similarity,  it  is  usually  neces¬ 
sary  to  establish  a  threshold  in  order  to  define  degrees 
of  acceptable  similarity  in  the  cluster-seeking  process. 
Preliminary  indications  are  that  the  feed-forward  BEP 
netv/ork  could  prove  to  be  quite  effective  in  forming 
both  the  similarity  measure  and  the  threshold  via  some 
combination  of  its  link  weights  and  bias  parameters. 

Simulation  results  indicate  that  for  the  clustering 
problem,  more  hidden  units  are  significant  and  con¬ 
tribute  to  the  feature  extraction  process.  The  acti¬ 
vation  levels  of  the  receptive  field  hidden  units  con¬ 
tain  the  feature  information.  However,  the  correlation 
of  the  input  patterns  are  high  and  there  exist  many 
overlaps  of  features  among  the  45  patterns.  Even  af¬ 
ter  the  feature  extractor  converges  with  low  PErr,  it 
is  difficult  to  discern  some  of  the  boundaries  due  to 
high  dimensionality  of  the  feature  space.  As  one  may 
suspect,  clear  boundaries  can  be  found  among  the  e.x- 
trema  of  many  hidden  nodes  that  separate  the  circular 
and  rectangular  silhouettes,  but  as  little  as  one  hidden 
node  separates  some  patterns  of  same  geometry,  such 
as  {bar, rod),  {rod,wedge),  and  {cylinder, cirjslant).  The 
collection  of  all  the  features,  distinctive  and  subtle, 
serve  as  training  vectors  for  the  classification  module. 
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6  Pattern  Classification 

The  classification  module  is  a  very  standard,  fully- 
connected  two-layer  BEP  module  and  its  details  are 
omitted  here.  The  values  for  its  input  units  are  prop¬ 
agated  from  the  activation  levels  of  the  feature  nodes 
from  the  proceeding  module.  The  output  units  clas¬ 
sify  them  into  nine  categories,  this  time  supervised  by 
an  explicit  teach  vector.  For  the  basic  pattern  set,  us¬ 
ing  the  patterns  shown  Table  2  above,  the  cla.ssifier 
learns  to  separate  those  features,  and  achieves  clean 
and  distinct  class  separations  at  convergence.  Similar 
results  are  obtained  for  classifying  all  45  patterns  into 
nine  categories,  although  convergence  time  (number  of 
iterations)  takes  five  to  seven  times  longer. 

7  Summary 

This  paper  presented  an  application  of  ANS  for  fea¬ 
ture  extraction,  clustering,  and  classification  of  tactile 
impressions  in  the  robotics  domain,  and  revealed  some 
basic  characteristics  of  the  BEP  network  to  geometry-, 
size-,  and  activation-constrained  grey-scale  data.  Fun¬ 
damental  results  indicate  that  with  appropriate  net¬ 
work  architecture  and  training  sets,  ANS  are  able  to 
extract  subtle  feature  differences  and  construct  high- 
dimension  decision  surfaces.  These  results  however, 
were  derived  from  a  very  small  training  sample,  and 
further  analysis  must  be  completed  before  genera!  con¬ 
clusions  can  be  reached.  The  response  of  ANS  to  noisy 
data,  and  their  effectiveness  in  classifying  un-trained 
patterns  are  being  examined.  Also,  scaling  studies 
should  indicate  if  these  concepts  can  be  extended  to 
(at  least  low-resolution)  vision  data. 

References 

[Cottrell  ei  ai,  1987]  G  Cottrell,  P  Munro,  and 
D  zipser.  Image  compression  by  back  propagation; 
An  example  of  extensional  programming.  ICS  Re¬ 
port  8702,  Feb  1987. 

[Goddard  ei  ai,  1989]  N  Goddard,  K  Lynne,  T  Mintz, 
and  L  Bukys.  Rochester  connectionist  simulator. 
Technical  Report  233,  University  of  Rochester,  Oct 
1989. 

[Hering  et  ai,  1990]  D  Hering,  P  Khosia,  and  B  Ku¬ 
mar.  The  use  of  modular  neural  networks  in  tactile 
sensing.  International  Joint  Conference  on  Neural 
Networks,  2:355-358,  Jan  1990. 

[Ilillis,  1982]  D  Hillis.  A  high  resolution  imaging  touch 
sensor,  international  Journal  of  Robotic  Research, 
l(2):33-44,  1982. 

[Kadonoff,  1983]  M  Kadonoff.  3-d  object  recognition 
and  orientation  with  gray-scale  data  using  ternary 
tree  structure  and  countour  centers.  Master’s  thesis, 
Duke  University,  Dec  1983. 


[Kinoshita,  1973]  G  Kinoshita.  Pattern  recognition  of 
the  grasped  object  by  the  artificial  hand.  Proceed¬ 
ings  of  the  3rd  International  Joint  Conference  on 
AI,  pages  665-669,  1973. 

[Kuczewski  et  ai,  1987]  R  Kuczewski,  M  Myers,  and 
W  Crawford.  Exploration  of  backward  error  propa¬ 
gation  as  a  self-organizational  structure.  IEEE  In¬ 
ternational  on  Neural  Networks,  2:89-95,  1987. 

[Marik,  1981]  V  Marik.  Algorithms  of  the  complex 
tactile  information  processing.  Proceedings  of  Inter¬ 
national  Joint  Conference  on  Artificial  Intelligence, 
pages  773-774,  Aug.  1981. 

[Muthurkishnan  et  ai,  1987]  C  Muthurkishnan, 
D  Smith,  D  Myers,  and  J  Rebman.  Edge  detection 
in  tactile  Images.  Proceedings  of  IEEE  International 
Conference  on  Robotics  and  Automation,  1987. 

[Nicholls  and  Lee,  1989]  H  Nicholls  and  M  Lee.  A  sur¬ 
vey  of  robot  tactile  sensing  technology.  The  Interna¬ 
tional  Journal  of  Robotics  Research,  8(3). 3-30,  June 
1989. 

[Rebman  and  Trull,  1983]  J  Rebman  and  M  Trull.  A 
robust  tactile  sensor  for  robot  applications.  Techni¬ 
cal  Report  LL-2142,  Lord  Corporation,  1983. 

[Rumelhart  ei  ai,  1986]  D  Rumelhart,  J  McClelland, 
and  PDP  Research  Group.  Parallel  Distributed  Pro¬ 
cessing,  volume  vl.  MIT  Press,  1986. 

[Sato  ei  ai,  1977]  N  Sato,  W  Heginbotham,  and 
A  Pugh.  A  method  for  three  dimensional  part  iden¬ 
tification  by  tactile  transducer.  Proceedings  tf  the 
7th  International  Symposium  on  Industrial  Robots, 
pages  577-585,  1977. 

[Saund,  1986]  E  Saund.  Abstraction  and  representa¬ 
tion  of  continuous  variables  in  connectionist  net¬ 
works.  Proceedings  of  the  Fifth  National  Conference 
on  Artificial  Intelligence,  pages  638-644,  1986. 

[Sietsma  and  Dow,  1988]  J  Sietsmaand  R  Dow.  Neu¬ 
ral  net  pruning  -  why  and  how.  IEEE  International 
Conference  on  Neural  Networks,  1:325-333,  1988. 

[Takeda,  1974]  S  Takeda.  Study  of  artificial  tactile 
sensors  for  shape  recognition  -  algorithm  for  tac¬ 
tile  data  input  -.  Proceedings  of  the  International 
Symposium  on  Industrial  Robots,  4:199-208,  1974. 

[Togai,  1982]  M  Togai.  Pattern  recognition  schemes 
for  a  touch-sensor  array  based  on  moment  invariant 
and  circle  search  methods.  Technical  Report  EE-82- 
03,  Duke  University,  Aug  1982. 


EXPLANATION-BASED  LEARNING 


260  Bostrom 


Generalizing  the  Order  of  Goals  as  an  Approach  to 

Generalizing  Number 

Henrik  Bostrom 

SYSLAB 

Department  of  Computer  and  Systems  Sciences 
Stockholm  University 
Electrum  230, 164  40  Kista 
Sweden 


Abstract 

The  common  idea  among  previous 
approaches  for  generalizing  number  is  to 
look  for  repeated  applications  of  rules 
(or  operators)  while  generalizing  an 
example  proof  to  produce  a  general 
schema.  We  describe  another  approach, 
the  algorithm  N,  that  generalizes 
number  from  planning  problems  which 
are  stated  in  the  STRIPS-formalism. 
Instead  of  generalizing  the  structure  of 
an  example  proof  in  terms  of  operator 
applications,  the  algorithm  generalizes 
the  order  in  which  the  literals  in  the 
goal  state  description  are  reached,  yiel¬ 
ding  a  generalized  precedence  graph. 
The  algorithm  N  is  compared  to  one  of 
the  previous  approaches  for  generaliz¬ 
ing  number,  that  can  be  applied  to  plan¬ 
ning  problems  which  are  stated  in  the 
STRIPS-formalism.  Experiments  have 
shown  that  schemata  produced  by  algo¬ 
rithm  N  can  be  more  efficiently  utilized 
than  schemata  produced  by  the  previous 
algorithm.  Also,  the  algorithm  N  is 
shown  to  be  able  to  handle  a  class  of 
problems  that  the  previous  algorithm 
cannot. 


1  Introduction 

To  generalize  the  structure  of  an  example  proof 
such  that  a  fixed  number  of  rule  applications  in 
the  proof  is  generalized  into  an  unbounded 
number  of  applications  is  commonly  referred  to 
as  generalizing  number.  Another  name  for  the 
same  principle  is  generalizing  to  N.  In  this 
work,  we  use  the  term  generalizing  number  in  a 
broader  sense  than  restricting  the  generaliza¬ 
tions  to  be  made  in  terms  of  rule  or  operator 
applications.  By  generalizing  number,  we  mean 
the  principle  of  generalizing  the  solution  of  a 
specific  example  problem  into  a  general  schema. 


that  can  be  used  to  solve  a  class  of  problems, 
where  the  members  not  only  may  differ  from 
each  other  with  respect  to  the  name  of  the 
objects  involved,  but  they  may  also  differ  in  the 
number  of  objects  involv^.  For  example,  an  inst¬ 
ance  of  generalizing  number  in  the  Block's 
World  is  to  learn  a  general  schema  for  building 
towers  of  arbitrary  height  from  a  proof  of  build¬ 
ing  a  tower  of  four  blocks  height.  Different 
approaches  of  generalizing  number  have  been 
presented  in  [Shavlik  &  Dejong  87],  [Cohen  et  al 
88],  [Cohen  88],  [Shavlik  88]  and  [Shavlik  89]. 

TTie  common  idea  among  these  approaches, 
while  generalizing  an  example  proof  to  produce 
a  general  schema,  is  to  look  for  repeated 
applications  of  rules  (or  operators).  When  such 
a  repetition  is  found,  a  loop  construct  is  added  to 
the  corresponding  general  schema.  We  have 
focused  on  generalizing  number  in  systems  that 
solve  planning  problems  based  on  the  STRIPS- 
formalism  [Nilsson  82].  The  previous 
approaches  to  generalizing  number  have  only 
been  considered  with  generalizations  of  proofs 
made  by  a  Horn  clause  theorem  prover. 
However,  one  of  the  previous  algorithms  [Cohen 
88]  can  also  be  used  to  generalize  proofs  in  the 
STRIPS-formalism.  This  method  has  one  major 
shortcoming.  The  general  schema  learned  by  the 
method,  contain  information  about  what  opera¬ 
tors  to  select  only,  and  not  how  to  apply  them 
(knowledge  about  selecting  the  appropriate 
match  substitution  according  to  the  terminology 
in  [Nilsson  82]).  in  the  tower  building  example, 
a  system  using  a  general  schema  learned  by  the 
method  to  build  a  certain  tower,  knows  what 
operator  to  use  in  a  particular  situation  (stack, 
pickup,  etc.),  but  does  not  know  what  blocks  to 
involve  in  the  action.  This  observation  has 
recently  been  presented  in  [Bostrom  89].  In  many 
domains  the  search  space  may  not  be  pruned 
enough  by  the  general  schemata  learned,  to 
enable  new  problems  of  the  same  class  to  be 
solved  within  acceptable  time. 

In  this  paper  we  present  the  algorithm  N,  a 
new  approach  to  generalizing  number  in  plan¬ 
ning  problems  based  on  the  STRIPS-formalism. 
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Instead  of  generalizing  the  shucture  of  an  exam¬ 
ple  proof  in  terms  of  operator  applications,  the 
algorithm  N  generalizes  the  order  in  which  the 
literals  in  the  goal  state  description  are 
reached.  The  algorithm  N  is  described  in  the 
next  section.  In  section  3  we  present  an  algorithm 
for  interpreting  schemata  produced  by  algo¬ 
rithm  N,  when  solving  new  problems.  A  short 
presentation  of  how  Cohen's  algorithm  can  be 
applied  to  planning  problems  in  the  STRIPS- 
formalism  is  given  in  section  4.  Some  prelimin¬ 
ary  results  comparing  the  algorithm  N  to  the 
algorithm  of  Cohen  are  presented  in  section  5. 

2  The  Algorithm  N 

We  first  define  the  input  to  algorithm  N,  a 
precedence  graph,  that  represents  all  solutions 
(with  some  restrictions)  to  an  example  problem. 
Then  we  describe  the  algorithm  N  itself,  which 
produces  a  generalized  precedence  graph. 
Finally,  we  give  some  examples  of  the  limita¬ 
tions  of  the  general  heuristic  used  in  algorithm 
N. 

2.1  Goal  orders  and  precedence  graphs 

A  goal  order  for  a  problem,  is  a  sequence  of  the 
literals  in  the  goal  state  description,  such  that 
the  literals  can  be  added  sequentially  by  apply¬ 
ing  STRIPS-operators  without  ever  deleting  a 
previously  added  literal  in  the  goal  state 
description.  A  precedence  graph  represents  all 
possible  goal  orders  for  a  certain  problem. 
Below,  we  give  definitions  of  these  two 
concepts.  Examples  of  precedence  graphs  for 
building  a  tower  and  an  arch  are  also  presented. 

Definition:  Given  an  initial  state  description  /<j, 
a  goal  state  description  Grf  and  a  set  of  STRIPS- 
operators  S.  A  goal  order  is  a  sequence  (/j, ...,  In) 
containing  all  literals  in  exactly  once,  such 
that  there  exists  a  solution  sequence^  (sj, ...,  sm) 
with  the  following  properties: 

i)  for  each  literal  Ty  in  (Zj, ...,  /«)  there  does  not 
exist  more  than  one  instance  Sj  in  {si,  ...,Sfn) 
that  contains  Ij  in  its  add  list  (each  literal  must 
not  be  added  more  than  once) 

ii)  for  each  pair  of  literals  (Zf,  Ip  in  {Ij, ...,  In) 
such  that  i<j,  for  each  pair  of  instances  (sjt,  si) 


solution  sequence  is  a  sequence  of  instances  of 
STRIPS-operators,  such  that  when  applied  to  the 
initial  state  description,  a  state  description  is 
produced  from  which  the  goal  state  description 
logically  follows. 


in  ($1, ...,  Stn),  if  sjt  contains  Zf  in  its  add  list  and 
SI  contains  Ij  in  its  add  list,  then  k<l.  (each  lite¬ 
ral  preceding  another  literal  in  the  sequence 
must  be  added  before  the  other) 

Definition:  Given  an  initial  state  description  Irf, 
a  goal  state  description  G<z  and  a  set  of  STRIPS- 
operators  S.  A  precedence  graph  G  is  a  directed 
acyclic  graph,  where  each  node  is  labeled  with 
a  literal,  and  that  has  the  following  properties: 

i)  every  literal  in  Gd  appears  exactly  once  in 
the  graph,  and  there  does  not  exist  a  literal  in  G 
that  is  not  member  of  Grf  (all  literals  in  the  goal 
state  description,  and  only  those,  muf*  be 
mentioned  in  the  graph) 

ii)  there  exists  a  subset  g  of  the  nodes  in  G  such 
that  each  node  in  the  graph  that  is  not  member 
of  g  can  be  reached  by  following  the  arcs  from  a 
node  in  g,  and  there  exists  a  node  in  the  graph 
that  can  be  reached  from  each  node  in  g  (all 
nodes  in  the  graph  must  be  connected) 

iii)  there  does  not  exist  a  goal  order  (Zj, ...,  In) 
with  respect  to  Id,  Gd  and  S,  such  that  there 
exists  a  pair  of  literals  (Zj,  Ip  in  the  sequence 
where  i<j  and  Zf  can  be  reached  from  Ij  in  G  by 
following  the  arcs  (a  literal  that  precedes 
another  literal  in  a  goal  order  must  not  be 
reachable  from  the  other  literal  in  the  graph) 

iv)  there  do  not  exist  three  different  literals  a,  b 
and  c  in  G  such  that  b  and  c  are  successors  of  a, 
and  c  can  be  reached  from  b  by  following  the 
arcs,  and  there  does  not  exist  a  pair  of  literals 
{a,  b)  such  that  there  are  more  than  one  arc  from 
a  to  Z;  (redundant  arcs  are  not  allowed). 

In  an  example  proof  one  particular  goal  order  is 
easily  found,  by  regarding  the  order  in  which 
the  literals  in  the  goal  description  are  added. 
But  in  order  to  find  the  corresponding  precedence 
graph  to  the  example  problem  we  need  all  goal 
orders  for  the  particular  problem,  i.e.  all  possi¬ 
ble  ways  of  solving  the  problem  without  delet¬ 
ing  a  previously  added  literal  in  the  goal  state 
description.  An  algorithm  for  deriving  a  prece¬ 
dence  graph,  given  all  goal  orders  for  a  problem 
is  presented  in  [Bostrom  90].  Henceforth,  we 
assume  that  the  corresponding  precedence  graph 
is  given  for  each  problem,  instead  of  all  goal 
orders. 

Example  1 

Given  the  initial  state  description  Id  = 
{handempty,  ontable(a),  clear(a),  ontable(b), 
clear(b),  ontable(c),  clear(c),  ontable(d), 
clear(d)),  the  goal  state  description  of  a  tower 
Gd  =  {on(a,  b),  on(b,  c),  on(c,  d)}  and  the  set  of 
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STRIPS-operators  for  Block's  World  [Nilsson  82, 
p281].  The  precedence  graph  (and  the  corres¬ 
ponding  tower)  is  shown  in  figure  1.  The  prece¬ 
dence  graph  is  linear^  since  there  exists  exactly 
one  goal  order,  namely  (on(c,  d),  on(b,  c),  on(a, 
b)). 


3 

I 

C 

_ Dj _ 

Table  ! 

on(c,d)-^on(b,c)-^- on(a,b) 

Figure  1.  A  precedence  graph  for  building  a 
tower. 

Example  2 

Assume  that  we  extend  the  set  of  STRIPS-opera¬ 
tors  in  example  1,  with  a  stack2(X,  Y,  Z)  opera¬ 
tor  that  can  be  used  to  stack  a  block  on  two 
blocks.  Given  the  initial  state  description 
where  all  blocks  are  on  the  table,  the  goal  state 
description  that  corresponds  to  the  arch  in 
figure  2  and  the  extended  set  of  STRIPS-operat¬ 
ors.  A  precedence  graph  for  building  the  arch  is 
shown  in  figure  2  (below  the  arch).  The  prece¬ 
dence  graph  in  figure  2  is  not  linear  since  there 
exist  more  than  one  goal  order  for  the  problem. 


Figure  2.  A  precedence  graph  for  an  arch. 


2.2  Generalizing  Precedence  Graphs 

The  principle  of  generalizing  number  was  defi¬ 
ned  as  the  principle  of  generalizing  the  number 
of  objects  involved  in  a  specific  example  in  order 
to  produce  a  general  schema.  Generalizing 
number  from  examples  of  building  a  tower  or  an 
arch  means  producing  general  schemata  that  can 
be  used  to  solve  problems  of  building  towers  and 
archs  involving  an  arbitrary  number  of  objects. 
The  commonly  adopted  heuristic  in  algorithms 
for  generalizing  number  is  to  look  for  some  kind 


of  repetition  in  a  proof  of  an  example  problem. 
In  contrary  to  previous  approaches  for  genera¬ 
lizing  number,  we  do  not  look  for  repetition  of 
operator  applications  in  a  proof.  Instead,  we 
look  for  repetition  in  the  corresponding  prece¬ 
dence  graph  to  the  example  problem.  The  ques¬ 
tion  is:  repetition  of  what? 

The  purpose  of  generalizing  a  precedence 
graph  is  to  be  able,  when  solving  a  problem,  to 
give  a  partial  ordering  of  the  literals  in  the 
goal  state  description.  This  ordering  can  then  be 
used  to  produce  a  solution  sequence.  A  precedence 
graph  can  be  viewed  as  an  instance  of  a  schema 
that  declares  order  constraints  on  the  literals  in 
the  goal  description.  We  want  our  algorithm  to 
determine  these  order  constraints  by  regarding 
the  precedence  graph  that  corresponds  to  the 
example  problem.  ITren  the  heuristic  of  looking 
for  repetition  of  order  constraints  in  the  prece¬ 
dence  graph  can  be  used  in  order  to  generalize 
number. 

We  assume  that  the  order  constraints  in  a 
precedence  graph  can  be  expressed  as  relations 
between  literals  that  are  directly  connected  by 
arcs.  The  algorithm  must  determine  for  each 
pair  of  literals  («,  b)  in  the  precedence  graph, 
such  that  b  is  a  successor  of  a,  how  they  are  rela¬ 
ted.  The  following  heuristic  is  used  in  algorithm 
N:  "two  literals  can  be  said  to  be  related  if  they 
share  at  least  one  object  constant  or  if  they  have 
different  object  constants  that  appear  as  argu¬ 
ments  of  the  same  literal  in  the  initial  state 
description".  Some  limitations  of  this  heuristic 
are  presented  in  section  2.3. 

The  generalization  of  a  precedence  graph  is 
made  by  the  algorithm  N  in  two  steps.  First,  a 
general  version  of  the  precedence  graph  is 
produced  where  the  arcs  represent  relations 
between  the  literals  in  the  precedence  graph. 
Then  the  new  graph  is  reduced,  by  merging  nodes 
having  arcs  with  common  relations,  yielding 
the  final  generalized  precedence  graph.  Below, 
we  give  an  overview  of  the  algorithm  N.  A 
detailed  description  of  the  algorithm  is  found  in 
[Bostrom  90]. 


^By  a  linear  precedence  graph,  we  mean  a  precedence 
graph  where  each  node  does  not  have  more  than  one 
successor  and  each  node  does  not  have  more  than  one 
predecessor. 


Generalizing  the  Order  of  Goals  as  an  Approach  to  Generalizing  Number  263 


Algorithm  N 

Input:  a  precedence  grapH  G  and  an  initial  state 
description  Jrf 

Output:  a  generalized  precedence  graph  G' 

1.  Let  G '  be  a  graph  consisting  of  the  nodes  in  G. 
For  each  pair  of  nodes  (nj,  n2),  such  that  «2  is 
a  successor  of  nj  in  G,  add  an  arc  from  nj  to  «2 
toG’. 

Label  all  arcs  with  triples,  where  the  first 
and  second  argument  are  the  corresponding 
literals  to  nj  andn2/  with  each  constant 
replaced  with  a  unique  variable.  The  third 
argument  is  a  set  of  relations  describing  the 
relation  between  the  two  literals,  according 
to  the  heuristic  that  literals  can  be  related 
by  sharing  constants,  or  by  having  different 
constants  as  arguments  that  appear  in  the 
same  literal  in  the  initial  state  description. 

2.  Merge  all  pair  of  nodes  (nj  ,  h2)  where  n2  can 
be  reached  from  wj  ,  and  where  both  nodes 
have  arcs  leading  from  them  labeled  with 
the  same  triples. 

Example  1  cont. 

Application  of  step  1  in  the  algorithm  to  the 
precedence  graph  in  figure  1  results  in  the  graph 
shown  in  figure  3a.  The  only  relation  found 
between  the  literals  'on(c,  d)'  and  'on(b,  c)'  in 
the  precedence  graph,  is  that  equality  holds 
between  the  first  literal's  first  argument  and 
the  second  literal's  second  argument.  There 
exists  no  literal  in  the  initial  state  description 
that  has  more  than  one  argument,  and  thus 
there  is  no  literal  that  contains  two  different 
constants  from  the  literals.  The  same  relation 
between  the  literals  'on(b,  c)'  and  'on(a,  b)'  in 
the  precedence  graph  is  found  by  step  1  in  the 
algorithm. 

After  step  2,  the  graph  in  figure  3a  is  genera¬ 
lized  into  the  generalized  precedence  graph  in 
figure  3b.  Two  nodes  have  b^n  merged,  since  one 
of  them  could  be  reached  from  the  other,  and 
from  both  nodes  arcs  were  leading  labeled  with 
the  same  triples. 


R1 

Rl  =  (on(X,Y),  on(Z,W),  { X=W )) 

Figure  3b.  A  generalized  precedence  graph  for 
building  towers. 

Example  2  cont. 

Application  of  step  1  in  the  algorithm  N  to  the 
precedence  graph  in  figure  2  results  in  the  graph 
shovm  in  figure  4a.  As  we  can  see,  the  same  rela¬ 
tion  has  been  found  as  in  the  previous  example 
(denoted  Rl),  but  also  two  other  relations  have 
been  found. 

In  step  2,  two  pairs  of  nodes  are  merged,  yield¬ 
ing  the  generalized  precedence  graph  in  figure 
4b.  No  more  nodes  in  the  resulting  generalized 
precedence  graph  allow  merging,  since  there 
does  not  exist  a  pair  of  nodes  from  which  iden¬ 
tically  labeled  arcs  lead,  and  where  one  node 
can  be  reached  from  the  other. 


Rl  =  (on(X,Y),  on(Z,W),  { X=W )) 

R2  =  (on(X,Y),  on2(4V,W),  ( X=V )) 

R3  =  (on(X,Y),  on2(Z,V,  W),  { X=W ) ) 

Figure  4a.  Graph  produced  by  step  1  in  algo¬ 
rithm  N  for  the  arch  building  example. 


Figure  4b.  A  generalized  precedence  graph  for 
building  archs. 


Rl  Rl 


Rl  =  (on(X,Y),  on(Z,W),  { X=W )) 


Figure  3a.  Graph  produced  by  step  1  in  algo¬ 
rithm  N  for  the  tower  building  example. 


23  Limitations  of  the  Heuristic  in  Algorithm  N 

Three  types  of  incorrect  generalizations  can  be 
noticed  due  to  the  heuristic  used  in  the  algo¬ 
rithm  N: 

i)  over-generalization  as  a  consequence  of  the 
algorithm's  unability  to  find  a  relation  between 
two  literals  in  the  precedence  graph  (i.e.  there 
does  not  exist  a  literal  in  the  initial  state 
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description  that  contains  object  constants  from 
both  of  the  literals  and  the  literals  have  no 
object  constant  in  common) 

ii)  under-generalization  as  a  consequence  of  the 
algorithm's  consideration  of  a  literal  in  the 
initial  state  description  that  is  irrelevant, 
when  looking  for  a  relation  between  two  literals 
in  the  precedence  graph 

iii)  under-generalization  as  a  consequence  of  an 
assumption  by  the  algorithm  that  two  literals 
in  the  precedence  graph  have  a  common  object 
constant  by  necessity,  when  it  is  by  a  coinci¬ 
dence. 

3  Interpretation  of  Generalized  Prece¬ 
dence  Graphs 

A  generalized  precedence  graph  is  used  to  give  a 
partial  ordering  of  the  literals  in  the  goal  state 
description,  that  can  be  used  to  produce  a  solut¬ 
ion  sequence.  In  this  section  we  describe  the 
algorithm  Find-solution  that  can  be  used  to  find 
a  solution  sequence  for  a  particular  planning 
problem,  given  a  generalized  precedence  graph. 

The  algorithm  works  in  three  steps.  First,  a 
set  of  tree  structured  graphs  are  produced  cons¬ 
training  the  order  of  the  literals  in  the  goal 
state  description.  Second,  a  potential  goal  order 
is  derived  using  the  set  of  trees.  Third,  a  domain 
independent  means-end  analysis  system  is  used 
to  produce  a  solution  sequence,  given  the  poten¬ 
tial  goal  order. 

The  algorithm  can  be  surrunarized  as  follows. 
For  algorithmic  details,  see  [Bostrom  90]. 

Algorithm  Find-solution 
Input:  a  iiutial  state  description  Id,  a  goal  state 
description  Gd,  a  set  of  STRIPS-operators  O  and 
a  generalized  precedence  graph  G' 

Output:  a  solution  sequence  S 

1.  A  set  of  tree  structured  graphs  is  aeated  in 
the  following  way.  First,  literals  in  the  goal 
state  description  are  matched  to  the  second 
argument  of  relations  that  label  arcs  leading 
to  nodes  without  successors.  For  each  literal 
that  matches  such  an  argument,  a  tree  is 
constructed  where  the  literal  is  the  root. 
From  each  root,  there  may  lead  branches  to 
other  literals.  These  literals  are  such  that 
they  match  the  first  argument  of  a  relation 
that  labels  an  arc,  that  leads  to  the  node  of 
the  root  in  the  generalized  precedence  graph. 
In  addition,  all  facts  in  the  third  argument 
must  be  true  with  respect  to  the  initial  state 
description  and  with  respect  to  the 


equivalence  relation.  From  the  literals  that 
match  the  first  argument  of  the  relation, 
trees  are  created  in  a  recursive  manner 
starting  at  the  nodes  from  which  the  arcs 
lead  in  the  generalized  precedence  graph. 

2.  A  potential  goal  order  is  produced,  given  the 
set  of  tree  shaped  graphs,  in  the  following 
way.  The  goal  order  is  initialized  to  the 
empty  sequence.  Iteratively,  it  is  checked  if 
there  exists  a  literal  that  is  not  member  of 
the  sequence,  such  that  each  literal  from 
which  it  can  be  reached  in  any  of  the  graphs, 
is  a  member  of  the  sequence.  If  that  is  the 
case,  then  the  literal  can  be  added  to  the  end 
of  the  sequence. 

3.  Given  a  goal  order,  a  solution  sequence  is 
produced  using  a  means-ends  analysis 
technique  [Nilsson  82,  p304].  The  main 
difference  between  our  algorithm  and  the 
original  algorithm  is  that  our  algorithm 
does  not  have  to  non-deterministically  select 
a  literal  in  the  goal,  and  that  no  domain 
knowledge  is  used  to  select  an  operator  to 
reduce  the  difference  between  the  current 
state  description  and  the  goal  state 
description.  The  algorithm  also  checks  that 
a  previously  added  literal  in  the  goal 
description  is  never  deleted. 

Example  1  cont. 

The  goal  is  to  build  the  tower  in  figure  5a  with 
use  of  the  the  original  set  of  STRIPS-operators 
from  the  initial  state  where  all  blocks  are  clear 
and  on  the  table.  Given  the  generalized  prece¬ 
dence  graph  in  figure  3b.  All  literals  in  the  goal 
state  description  can  match  the  second  argument 
of  the  triple  '(on(X,  Y),  on(Z,  W),  (X=W})’. 
Hence,  v/hen  applying  step  1  in  Find-solution,  a 
tree  is  constructed  for  each  literal  in  the  goal 
state  description,  with  the  literal  as  a  root.  The 
set  of  tree  structured  graphs  are  shown  in  figure 
5b.  The  tree  consists  of  a  single  literal,  when  the 
literal  'on(e.  O'  is  the  root,  since  there  does  not 
exist  another  literal  in  the  goal  state  descrip¬ 
tion  that  matches  the  first  argument,  such  that 
the  equality  holds. 

Application  of  step  2  to  the  set  of  tree  structu¬ 
red  graphs  produces  the  goal  order  (on(e,  0, 
on(d,  e),  on(c,  d),  on(b,  c),  on(a,  b)),  since  it  is  the 
only  possible  goal  order.  The  solution  sequence 
produced  by  step  3  is  (pickup(e),  stackfe,  f), 
pick(d),  stack(d,  e),  pickup(c),  stack(c,  d), 
pickup(b),  stack(b,  c),  pickup(a),  stack(a,  b)). 
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Figure  5b.  Tree  structured  graphs  created  by  step 
1  in  Find-solution  for  the  tower  building 
problem. 

4  Cohen’s  Algorithm 

We  briefly  present  the  algorithm  of  Cohen 
[Cohen  88].  The  algorithm  is  intended  to  be 
applied  to  proofs  made  by  a  Horn  clause  theo¬ 
rem  proven  However,  in  this  section  we  show 
how  the  algorithm  also  can  be  applied  to  proofs 
given  in  the  STRIPS-formalism. 

Cohen's  algorithm  learns  deterministic 
control  automata  from  example  proofs.  Each  arc 
of  a  control  automaton  is  labeled  with  an  input 
symbol  representing  a  set  of  rules,  and  an  output 
symbol  representing  a  single  rule.  The  control 
automaton  guides  a  proof  as  follows:  first  all 
applicable  rules  are  collected  into  a  set  S.  If 
there  is  an  arc  leading  from  the  current  state, 
with  input  symbol  S,  then  the  rule  corresponding 
to  the  output  symbol  is  applied.  The  new  current 
state  will  be  the  one  the  arc  was  leading  to.  If 
there  is  no  arc  with  input  symbol  S,  the  proof 
fails.  The  proof  succeeds  if  it  is  completed  using 
the  sequence  of  rules  output  by  the  control  auto¬ 
maton. 

Cohen's  algorithm  applied  to  a  proof  of  build¬ 
ing  the  tower  in  figure  1  produces  the  initial 
control  automaton  in  figure  6a.  The  next  step  in 
the  algorithm  checks  the  possibility  of  merging 
states  having  arcs  with  the  same  input/output 
symbols  (if  it  is  possible  to  make  the  automaton 
deterministic  after  merging).  All  possible 
mergings  are  made,  yielding  a  reduced  determi¬ 
nistic  control  automaton.  The  corresponding 
reduced  automaton  to  the  initial  control  auto¬ 
maton,  is  presented  in  figure  6b. 


Figure  6a.  Initial  control  automaton  for  the 
tower  building  example. 


Figure  6b.  Reduced  control  automaton  for  the 
tower  building  example. 


In  Cohen's  system  ADEPT,  backtracking  is  not 
permitted.  In  the  case,  when  the  system  has 
learned  to  build  towers,  and  a  new  tower  is  to  be 
built,  the  system  will  not  succeed  if  the  wrong 
block  is  first  picked  up.  Since  the  probability  for 
incorrectly  choosing  a  block  is  very  large,  the 
learned  procedure  will  seldom  (or  in  more 
complex  cases  never)  work.  However,  in  our 
comparison  we  permit  the  system  to  backtrack 
when  a  dead-end  is  reached. 

5  Results 

In  this  section  we  present  some  preliminary 
results  from  the  comparison  of  the  algorithm  N 
with  the  algorithm  of  Cohen.  Two  main  ques¬ 
tions  are  raised  in  this  section: 

a)  Which  of  the  two  algorithms  produces  the 
most  efficient  general  schema? 

b)  Are  there  any  examples  that  one  of  the  algo¬ 
rithms  does  learn  from,  but  the  other  does  not? 

The  efficiency  of  the  learning  algorithms  is  not 
considered  in  this  comparison. 

The  size  of  the  search  tree  when  using  a 
schema  produced  by  Cohen's  algorithm,  is  enti¬ 
rely  dependent  on  the  number  of  possible  opera- 
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tor  applications.  When  a  generalized  prece¬ 
dence  graph  is  used,  the  search  tree  depends  on 
the  number  of  literals  that  matches  relations  in 
the  generalized  precedence  graph  and  on  the 
search  made  by  the  domain  independent  means- 
ends  analysis  algorithm  given  the  goal  order. 

Since  these  numbers  may  vary  between 
domains  and  problems  it  is  difficult  to  give  a 
general  answer.  However,  we  have  made  a 
number  of  limited  experiments  comparing  the 
efficiency  of  the  general  schemata  produced  by 
the  algorithm  N  and  the  algorithm  of  Cohen. 
Figure  7  shows  the  growth  of  time  taken  to  build 
towers  of  different  height  with  schemata 
produced  by  Cohen's  algorithm  and  the  algo¬ 
rithm  N.  In  addition,  we  present  the  perfor¬ 
mance  of  a  domain  independent  means-ends 
analysis  system,  that  in  contrast  to  the  means- 
ends  analysis  system  used  when  interpreting 
generalized  precedence  graphs,  has  to  non- 
deterministically  select  a  literal  in  the  goal 
state  description.  All  problems  given  to  the 
systems  are  stated  as  worst-case  problems  (i.e. 
the  literals  in  the  initial  state  description  and 
goal  state  description  are  ordered  so  that  the 
appropriate  literals  are  chosen  last  when  inst¬ 
antiating  operators  and  selecting  subgoals 
respectively).  The  algorithms  are  implemented 
in  PROLCX3  on  a  Macintosh  II.  The  learning  case 
is  the  problem  of  building  a  tower  of  4  blocks. 


Tower  buUding 


Figure  7.  Results  from  the  tower  building  exper¬ 
iment. 

Two  examples  can  be  given  as  an  answer  to  the 
second  question  stated  above.  Cohen's  algo¬ 
rithm  is  able  to  learn  the  control  automaton  in 
figure  6b  from  a  proof  of  building  a  tower  with 
three  blocks.  The  algorithm  N  needs  at  least 
three  literals  in  the  goal  state  description  to  be 
able  to  generalize,  and  thus  cannot  generalize  a 


proof  of  building  a  tower  with  three  blocks  (two 
literals).  On  the  other  hand,  there  are  many 
classes  of  problems  that  the  algorithm  N  can 
learn  schemata  for  and  the  algorithm  of  Cohen 
cannot.  In  figure  8  we  show  an  initial  control 
automaton  for  building  an  arch,  that  cannot  be 
reduced  and  made  deterministic.  The  algorithm 
N  is  able  to  generalize  the  corresponding  prece¬ 
dence  graph  (as  was  shown  in  example  2)  and 
thus  can  learn  from  an  example  problem  that 
Cohen's  algorithm  cannot. 


Figure  8.  An  unreducable  initial  control  automa¬ 
ton. 


6  Conclusions 

We  have  presented  algorithm  N,  a  new 
approach  to  generalizing  number  that  can  be 
used  for  planning  problems  which  are  stated  in 
the  STRIPS-formalism.  Instead  of  generalizing  a 
single  proof  in  terms  of  operator  applications, 
our  algorithm  generalizes  all  solutions  to  the 
example  problem,  by  generalizing  the  order  in 
which  the  literals  in  the  goal  state  description 
car.  be  reached,  yielding  a  generalized  prece¬ 
dence  graph. 

The  algorithm  N  is  compared  to  an  algorithm 
for  generalizing  number  that  generalizes  proofs 
in  terms  of  rule  applications  [Cohen  88]. 
Experiments  have  shown  that  it  takes  less  time 
to  find  a  solution  sequence  using  a  generalized 
precedence  graph  than  finding  a  proof  with  a 
control  automaton  in  a  number  of  cases. 

The  algorithm  N  is  shown  to  be  able  to  handle 
a  class  of  problems  that  the  algorithm  of  Cohen 
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cannot.  These  problems  are  such  that  it  is  not 
possible  to  make  deterministic  reduced  auto¬ 
mata  that  correspond  to  the  example  proofs.  On 
the  other  hand,  we  have  shown  how  Cohen's 
algorithm  can  learn  from  a  smaller  example 
than  the  algorithm  N. 

The  algorithm  N  can  be  improved  with 
respect  to  a  number  of  aspects.  The  general 
heuristic  used  in  the  algorithm  for  finding  repe¬ 
tition  in  a  precedence  graph,  may  be  revised, 
since  there  exists  problems  that  cannot  be  gener¬ 
alized  appropriately  (see  section  2.3). 

Another  direction  to  proceed  in,  is  to  develop 
an  algorithm  that  can  combine  schemata  lear¬ 
ned  from  different  examples  when  solving  a 
problem.  Up  till  now,  new  problems  have  been 
solved  using  a  single  schema  only.  Another 
question  is  how  to  use  the  generalized  prece¬ 
dence  graphs  to  solve  problems  that  are  subpro¬ 
blems  of  the  problems  that  the  graphs  repre¬ 
sent. 
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Abstract 

One  of  the  difficult  problems  in  the  area  of 
explanation  based  learning  is  the  uiilUy  prob~ 
lem;  learning  too  many  rules  of  low  util¬ 
ity  can  lead  to  swamping,  or  degradation 
of  performance.  This  paper  introduces  two 
new  techniques  for  improvmg  the  utility  of 
learned  rules.  The  first  technique  is  to  com¬ 
bine  EBL  with  inductive  lesirning  techniques 
to  learn  a  better  set  of  control  rules;  the  sec¬ 
ond  technique  is  to  use  these  inductive  tech¬ 
niques  to  learn  approximate  control  rules. 

The  two  techniques  eure  synthesized  in  an  al¬ 
gorithm  called  approximating  abduciive  ex¬ 
planation  based  learning  (Ai^-EBL).  AxA- 
EBL  is  shown  to  improve  substantially  over 
standard  EBL  in  several  domains. 

1  Introduction 

One  of  the  difficult  problems  in  the  area  of  explemation 
based  learning  is  the  utility  problem.  The  utility  of  a 
rule  is  its  contribution  to  performance  improvement; 
the  utility  is  directly  proportional  to  the  coverage  of 
a  rule  and  inversely  proportional  to  the  match  cost  of 
a  rule,  where  coverage  is  defined  as  the  percentage  of 
the  time  that  the  rule  is  used  in  problem  solving,  and 
match  cost  is  simply  the  time  needed  to  determine  if  a 
rule  is  applicable. 

The  utility  problem  often  arises  in  learning  rules 
intended  to  improve  che  performance  of  a  problem 
solver.  Learning  too  many  rules  of  low  utility  can 
lead  to  swamping,  or  degradation  of  performance: 
in  the  worst  case,  a  problem  solver  can  be  slower 
after  learning  than  before  learning.  Previous  re¬ 
search  on  the  utility  problem  has  focused  on  detect¬ 
ing  and  discarding  rules  of  low  utility  [Minton,  1988; 
Markovitch  and  Scott,  1989],  lowering  the  match  cost 
of  rules  using  partial  evaluation  or  other  simplification 
techniques  [Prieditis  and  Mostow,  1987;  Minton,  1988; 
Tambe  and  Rosenbloom,  1989],  and  constraining  the 
use  of  learned  rules  [Mooney,  1989]. 

This  paper  introduces  two  new  techniques  for  im¬ 
proving  the  utility  of  learned  rules.  The  first  technique 
is  to  combine  EBL  with  inductive  learning  techniques 


to  learn  a  better  set  of  control  rules.  The  second  tech¬ 
nique  is  to  use  these  inductive  techniques  to  learn  ap¬ 
proximate  control  rules:  that  is,  to  learn  rules  which 
are  approximations  of  the  control  rules  which  would 
be  learned  by  standard  EBL  techniques.  Although  use 
of  inductive  techniques  alone  does  not  always  lead  to 
rules  of  higher  utility,  the  combination  of  these  tech¬ 
niques  is  shown  to  substantially  improve  the  perfor¬ 
mance  of  EBL  across  a  variety  of  domains. 

Combining  EBL  with  inductive  learning.  Flann  and 
Dictterich  have  noted  that  in  the  process  of  learning 
control  rules,  many  EBL  systems  make  inductive  leaps. 
An  example  given  in  [Flann  and  Dietterich,  1989]  is 
a  rule  learned  by  LEX2  in  the  symbolic  integration 
domain: 

If  the  current  problem  matches 
/  ca:’’dx(r  —  1) 

Then  use  the  operator 
Jcf{x)dx  c  j  f{x)dx 

Forming  this  rule  is  an  inductive  leap;  this  is  witnessed 
by  the  fact  that  the  rule  recommends  the  wrong  action 
on  the  problem  /  Ox^dx  (multipying  out  the  zero  to 
get  J  Odx  leads  to  a  shorter  solution).  Another  case  in 
which  inductive  decisions  eue  made  is  when  only  one  of 
several  possible  control  rules  is  learned;  for  example, 
PRODIGY  might  learn  either  a  macro-operator  or  a 
set  of  operator  preference  rules  from  a  trace  [Minton, 
1988]. 

However,  the  standard  implementation  of  EBL, 
in  which  control  rules  are  learned  incrementally  as 
needed,  is  relatively  slow  as  an  inductive  learner.  Com¬ 
bining  EBL  with  more  powerful  inductive  learning 
techniques  can  improve  the  rate  at  which  leaning  con¬ 
verges  to  an  adequate  set  of  control  rules.  A  crucial 
point  is  that  on  average,  improving  the  rate  of  conver¬ 
gence  will  also  improve  the  coverage  of  the  individual 
rules  learned.  This  is  true  because  in  order  to  improve 
the  learning  rate,  it  necessary  to  find  fewer  rules  with 
better  coverage.^ 


‘It  must  be  emphasized  that  we  are  not  advocating  use 
of  pure  inductive  learning  techniques  {e.g.,  version  spaces) 
to  learn  control  rules.  Explanation-based  techniques  are 
clearly  more  appropriate  in  learning  control  rules  because 
there  is  a  strong  theory. 
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Using  approximate  rules.  The  first  technique  alone, 
however,  is  not  enough  to  avoid  swamping.  The  prob¬ 
lem  is  that  in  learning  control  rules,  one  is  trying  to 
maximize  the  utility  of  the  set  of  rules.  To  do  this, 
one  must  simultaneously  maximize  total  coverage^  and 
minimize  total  match  cost  Standard  inductive  learn¬ 
ing  techniques,  however,  maximize  coverage  without 
regard  to  match  cost. 

Viewed  in  this  light,  learning  useful  control  rules  is 
a  multiple-resource  optimization  problem,  in  which  the 
two  antagonistic  resources  of  coverage  and  match  cost 
must  be  simultaneously  optimized.  A  standard  tech¬ 
nique  for  solving  such  problems  is  to  set  an  artificial 
constraint  on  one  resource,  and  optimize  the  other  re¬ 
source  subject  to  this  constrsint.  This  technique  can 
be  applied  to  the  control  rule  learning  problem  by  con¬ 
straining  the  inductive  learner  to  only  consider  control 
rules  with  match  cost  below  a  certain  fixed  cutoff. 

Unfortunately,  the  control  rules  learned  by  expla¬ 
nation  based  techniques  are  almost  always  very  com¬ 
plex;  hence  if  a  reasonable  match-cost  cutoff  were  cho¬ 
sen,  very  few  rules  would  be  available  for  the  induc¬ 
tive  learning  method  to  consider.  This  problem  can 
be  remedied  by  allowing  the  learner  to  also  consider 
approximations  to  expeu^.ive  control  rules. 

In  the  remainder  of  the  paper,  we  first  discuss  the 
control  rule  learning  problem  used  to  test  these  ideas, 
and  then  the  learning  algorithm  used.  We  then  de¬ 
scribe  the  domains  used  for  experimentation.  Finally, 
we  present  and  interpret  our  experimental  results  and 
draw  some  conclusions. 

2  The  learning  problem 

To  evaluate  the  leeirning  algorithms  discussed  in  this 
paper,  we  used  as  a  testbed  the  problem  of  learning 
clause  selection  rules  for  Prolog  programs.  This  learn¬ 
ing  problem  has  several  advantages. 

•  Commercial  Prologs  are  highly  optimized,  and 
hence  the  initial  problem  solver  has  a  fmrly  low 
overhead.  This  encourages  careful  experimentet- 
tion,  and  also  means  that  comparisons  of  perfor- 
meince  before  and  after  learning  are  a  reasonable 
test  of  the  overall  effectiveness  of  learning. 

•  Many  different  problem  solving  strategies  (for  ex¬ 
ample,  state  space  search,  problem  decomposi¬ 
tion,  means-end  analysis,  etc.)  can  be  easily 
coded  as  Prolog  programs  in  such  a  way  that 
learning  clause  selection  rules  significantly  im- 

Our  learning  testbed  consists  of  two  routines.  The 
first  is  a  critic  which  analyzes  a  set  of  Prolog  traces  and 
extracts  examples  of  correct  and  incorrect  clause  selec¬ 
tion  decisions.  A  correct  decision  is  any  decision  which 
leads  to  a  solution;  an  incorrect  decision  is  any  decision 
which  is  eventually  retracted  by  Prolog’s  backtracking 
strategy.  Along  with  each  correct  decision,  a  proof  of 
its  correctness  is  recorded.  The  theory  used  to  derive 
this  proof  is  very  simple:  it  merely  contains  the  clauses 
of  the  initial  program  as  axioms,  and  also  an  axiom  of 


the  form 

useful(cf.  A) 

for  each  clause  c,-  =  (A  ♦-  Bi,...,Bg)  in  the  initial 
program.  This  ‘usefulness  theory”  is  an  extension  to 
the  problem  of  Prolog  clause  selection  of  the  theory 
used  in  the  LEX/2  system  [Mitchell,  1982]  for  operator 
selection. 

The  second  routine  is  a  control  rule  compiler  which, 
given  a  set  of  control  rules  and  an  initial  program, 
generates  a  new  Prolog  program  which  incorporates 
the  learned  control  rules.  The  routine  takes  as  input 
a  set  of  control  rules  of  the  form 

useful(c,-.  A)  *~U{A) 

which  can  be  read  as  saying  “if  Z/(A)  is  true,  then  it  is 
useful  to  apply  clause  c,-  to  the  goal  A”.  U{A)  is  typ¬ 
ically  a  complex  conjunctive  condition  involving  the 
variables  of  the  goal  A.  For  each  such  rule,  the  post¬ 
processor  constructs  a  copy  of  clause  c:  to  which  U  (A) 
has  been  added  as  an  additiontil  “filtering  condition”; 
in  other  words,  if  c,-  =  (A  ♦-  Bi, . . . ,  Bq),  then  a  clause 
of  the  form 

A-W(A),!,Bi . Bq 

is  constructed.  Notice  that  the  cut  (!)  makes  selection 
of  this  clause  a  committed  choice:  no  backtracking  will 
be  done  if  this  choice  is  incorrect. 

One  advantage  of  incorporating  control  rules  in  this 
way  is  that  matching  is  done  by  Prolog’s  unification 
procedure;  directly  using  the  implementation  language 
reduces  the  overhead  of  pattern  matching. 

If  there  is  more  than  one  control  rule  for  a,  then 
multiple  copies  of  c,-  are  generated:  i.e.,  multiple  con¬ 
trol  rules  are  interpreted  disjunctively.  A  copy  of  the 
original  clause  with  no  filtering  condition  is  retained 
if  and  only  if  some  decisions  to  use  that  clause  are 
not  explained  by  any  control  rule.  Clauses  in  the  new 
program  are  ordered  first  according  to  the  order  of  the 
clauses  in  the  initial  program  to  which  they  correspond 
(since  this  ordering  may  represent  important  control 
information),  and  then  according  to  the  ordering  of 
control  rules. 

No  simplification  or  compression  is  done. 

Retaining  the  original  clauses  when  some  control  de¬ 
cisions  are  unexplained  means  that  some  performance 
improvement  can  be  gained  even  if  control  rules  can 
only  be  learned  for  some  clause  selection  decisions. 
This  makes  the  learning  system  less  brittle;  however, 
it  also  means  that  some  backtracking  may  still  be  done 
in  the  learned  program.  This  is  the  motive  for  discard¬ 
ing  the  original  clauses  when  all  control  decisions  have 
been  explained,  and  requiring  that  learned  control  de¬ 
cisions  are  not  backtracked  over;  otherwise,  in  back¬ 
tracking,  all  of  the  solutions  considered  by  the  original 
program  would  still  have  to  be  examined  and  rejected, 
limiting  the  degree  to  which  performance  can  be  im¬ 
proved.  However,  discarding  clauses  also  means  that 
the  learned  program  may  fail  on  problems  which  are 
solvable  by  the  initial  program.  When  this  happens 
the  original  program  is  invoked  on  the  top-level  goal. 
Thus,  in  spite  of  using  approximate  coutiol  rules  and 
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committed  choice  control  decisions,  the  learned  pro¬ 
gram  always  solves  at  least  as  many  problems  as  the 
original  program  did. 

3  The  learning  algorithms 

Throe  learning  algorithms  were  experimentally  com¬ 
pared.  The  first  two  are  strawmen,  corresponding  to 
a  conventional  EBL  control-rule  learning  system,  and 
to  an  extended  EBL  learning  system  which  uses  in¬ 
ductive  learning  techniques  to  improve  coverage,  but 
which  ignores  match  cost.  The  third  algorithm  uses 
inductive  techniques  to  search  a  space  of  approximate 
rules,  and  is  the  main  focus  of  this  research. 

One  important  difference  between  these  algorithms 
and  conventional  control-rule  learning  algorithms  is 
that  they  are  non-incremental.  Each  learning  algo¬ 
rithm  takes  as  input  a  set  of  examples  of  correct  and 
incorrect  control  decisions,  which  are  generated  by  ap¬ 
plying  the  critic  to  a  large  set  of  traces. 

Standt  rd  EBL  (EBL):  Standard  EBL  starts  with 
^ln  empty  ist  of  control  rules,  and  examines  each  cor¬ 
rect  contro  decision  in  the  order  in  which  they  were 
made.  If  there  are  no  existing  control  rules  which  ac¬ 
count  for  the  decision,  then  EBG  is  applied  to  the 
proof  of  the  correctness  of  the  control  decision,  and 
the  resulting  rule  is  added  to  the  end  of  the  list  of 
learned  control  tules.^ 

Standard  EBL  is  a  batch  simulation  of  a  SOARr 
like  incremental  learning  system  which  uses  a  learned 
clause  selection  rule  if  one  is  available,  and  which  oth¬ 
erwise  first  uses  search  to  determine  the  right  clause, 
then  forms  a  clause  selection  rule  which  sumnmrizes 
the  results  of  the  search  process. 

Abductive  EBL  (A-EBL):  Abductive  EBL  also 
starts  with  an  empty  list  of  control  rules,  but  repeat¬ 
edly  adds  to  the  end  of  the  list  the  control  rule  R 
which  a)  is  comisieni,  i.e.,  does  not  explain  any  in¬ 
correct  control  decisions,  and  b)  n  •  .dmizes  the  ratio 
of  the  number  of  unexplained  decisions  explained  by 
R  to  the  size  of  R.^  R  is  chosen  from  a  candidate 
'pool  of  rules  which  consists  of  all  rules  which  could  be 
generated  by  applying  EBG  to  some  correct  control 
decision. 

The  main  loop  in  the  A-EBL  algorithm  implements 
a  greedy  set  cover  of  the  control  decisions;  in  other 
words,  A-EBL  looks  for  a  minimal-size  set  of  control 
rules  to  explain  all  the  control  decisions.  The  greedy 
set  cover  technique  has  also  ’.  een  used  in  Haussler’s 
algorithm  for  learning  disjunctive  boolean  formulae. 
A-EBL  has  been  used  with  some  success  on  induc¬ 
tive  learning  tasks  [Cohen,  1989;  Cohen,  199U].  Its 
main  advantage  over  standard  EBL  techniquer  for  such 
tasks  is  that  it  can  be  used  even  on  an  “abductive” 
domain  theory  —  one  which  generates  multiple  incon¬ 
sistent  explanations.  A-EBL  has  been  shown  to  sat¬ 
isfy  Valiant’s  criterion  of  pac-learnability,  and  to  have 

^Expenmcntal  studies  [Shavlik,  1987]  suggest  that  this 
ordering  is  most  beneficial. 

*The  if  a  rule  is  defined  to  be  the  number  of  nodes 
in  the  ex;  ition  structure  used  to  form  the  rule. 


near-optimal  sample  complexity  [Cohen,  1989]. 

Approximating  abductive  EBL  (AxA-EBL): 
Approximating  abductive  EBL  works  exactly  like  A- 
EBL,  except  that  the  candidate  pool  consists  of  all 
k‘bounded  approximations  to  a  rule  generated  by  ap¬ 
plying  EBG  to  some  correct  control  decision.  A  k- 
bounded  approximation  is  formed  by  dropping  all  but 
j  conditions  from  the  body  of  the  rule,  for  some  j  <  k. 

A  ib-bounded  approximation  rule  will  have  low 
match  cost,  unless  the  operational  conditions  are  ex¬ 
pensive  to  test.  For  example,  if  the  truth  of  op¬ 
erational  conditions  is  tested  by  database  lookup  in 
a  database  of  n  facts,  then  the  match  cost  of  a  k- 
bounded  approximation  is  0(n*'). 

AxA-EBL  is  more  computationally  expensive  than 
EBL;  it  runs  iu  time  polynomial  in  the  total  size  of  the 
proofs  of  the  control  decisions,  but  exponential  in  k. 
Hence,  only  small  values  of  k  can  be  used.  In  the  cur¬ 
rent  implementation,  two  techniques  are  used  to  prune 
the  search  space  of  approximate  rules:  Prolog  mode 
declarations  are  used  to  eliminate  ill-formed  approxi¬ 
mations,  and  an  admbsible  heuristic  search  is  used  to 
find  the  optimal  rule  from  the  candidate  pool.  These 
tricks  substantially  reduce  learning  time;  the  current 
implementation  of  AxA-EBL  takes  an  average  of  15.5 
CPU  minutes  on  a  Sun/4  to  process  120  traces  from 
the  STRIPS  robot-world  domain  described  below.  (It 
takes  about  20  minutes  to  solve  the  120  problems  using 
the  original  means-ends  analysis  planner.) 

A-EBL  is  also  computationally  expensive,  but  for  a 
different  reason:  to  find  the  optimal  rule  in  the  candi¬ 
date  pool,  each  candidate  must  be  tested  agednst  every 
control  decision.  This  is  very  expensive  if  rules  have  a 
high  match  cost. 

Stand^ud  EBL,  A-EBL,  and  AxA-EBL  share  two 
properties  which  appear  to  be  crucial  for  success  on 
this  learning  task.  First,  all  three  algorithms  can  be 
constrained  to  produce  rules  which  contain  only  oper¬ 
ational  features.  Second,  like  standard  EBL,  both  A- 
EBL  and  AxA-EBL  usually  learn  several  control  rules 
which  determine  when  a  particular  clause  should  be 
selected:  in  other  words,  the  concept  of  “usefulness” 
for  each  clause  can  be  disjunctively  defined.  The  latter 
property  is  important  because  many  Prolog  programs 
contain  clauses  which  are  useful  in  several  different 
situations.  Usually,  several  control  rules,  interpreted 
disjunctively,  will  be  needed  to  determine  when  such 
a  clause  is  useful. 

A-EBL  and  AxA-EBL  differ  in  these  respects  from 
other  inductive  extensions  to  EBL,  in  particular 
mEBG  and  lOE  [Flann  and  Dietterich,  1989]. 

4  Experimental  results 

4.1  Description  of  the  domains 

Exper'  "entation  has  been  done  with  these  three 
learning  algorithms  in  several  domains.  The  domains 
are  summarized  in  Table  1. 

/iLEX:  A  simplified  version  of  symbolic  integration 
using  state-space  search  with  iterative  deepening.  The 
training  and  test  problem  sets  and  the  operator  set  are 
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Table  1:  Summary  of  domains  used  in  experiments 


Domain 

Description 

Search 

Strategy 

Number  of 
Operators 

Training  Set 

Sizes 

Test  Set 
Size 

/iLEX 

simplified  LEX 

state-space 

. "36  ■ 

4,8,12,16,19 

■Bi 

RW 

STRIPS  robot  world 

means-ends 

10 

BW 

Niisson  blocks  world 

means-ends 

4 

GRID 

graph  search 

depth-first 

3 

5,10,15,25 

taken  from  [Keller,  1987]. 

RW:  The  STRIPS  robot  world  of  [Pikes  ei  ai,  1972], 
with  a  means-ends  analysis  planner.  Problems  for  the 
training  and  test  set  are  generated  by  selecting  a  floor- 
plan  randomly  from  the  three  given  in  [Minton,  1988], 
randomly  placing  3  blocks  and  the  robot,  and  then  tak¬ 
ing  a  random  walk  of  bounded  length  in  state  space. 

BW:  The  blocks  world  of  [Nilsson,  1987],  with  the 
identiced  means-ends  analysis  planner.  Problems  are 
generated  by  constructing  ein  initial  scene  with  20  ran¬ 
domly  placed  blocks,  and  then  taking  a  random  walk 
of  bounded  length  in  state  space. 

GRID:  Depth-bounded  depth-first  search  of  an  ar¬ 
tificial  graph.  The  graph  is  a  15x15  grid,  with  75 
randomly  placed  obstacles,  and  5  randomly  placed 
“cities”,  which  3ie  tightly  connected  by  a  network  of 
five  “interstate  highways”.  There  are  three  kinds  of 
operationally  different  connections  in  the  graph,  cor¬ 
responding  to  increasing  or  decreasing  the  value  of  a 
coordinate  by  1  or  followir"  an  “interstate”.  Problems 
are  selected  by  randomly  picking  start  and  end  points, 
and  asking  for  a  path  between  the  points. 

BW  and  RW  problems  can  be  quite  difficult,  so  the 
pleinner  includes  two  resource  bounds:  first,  a  bound 
on  the  meiximal  length  of  the  plan,  and  second,  a  time 
bound.  The  time  bound  was  set  at  60  seconds  for  these 
experiments:  almost  all  of  the  blocks  world  problems 
and  about  6/7  of  the  robot  world  problems  were  solv¬ 
able  in  this  time  bound. 

The  domains  are  listed  roughly  in  order  of  their  diffi¬ 
culty  for  standard  EBL  techniques.  For  instance,  sym¬ 
bolic  integration  is  well  suited  to  use  of  EBL  for  opera¬ 
tor  selection;  however,  unconstrained  use  of  EBL  in  the 
blocks  world  domain  can  lead  to  swcimping  [Minton, 
1988;  Mooney,  1989].  Graph  searching  is  a  very  diffi¬ 
cult  problem  for  standard  EBL  techniques  because  the 
match  cost  of  learned  rules  is  very  high,  and  the  cost 
of  problem  solving  without  any  learned  rules  is  low; 
matching  the  rules  learned  by  EBL  is  equiveilent  to 
solving  the  NP-complete  subgraph  isomorphism  prob¬ 
lem  iMinton,  1985j,  whereas  depth-first,  search  only 
requires  time  linear  in  the  size  of  the  graph. 

4.2  Comparison  of  EBL  and  AxA-EBL 

A  series  of  experiments  were  designed  to  determine 
how  the  performance  of  a  learned  program  varied  as 
a  function  of  the  number  of  training  examples  used  in 
learning.  In  each  experiment,  first  a  test  set,  of  the 
size  indicated  in  table  1,  was  selected.  In  the  case  of 
BW,  RW  and  GRID,  the  test  problems  were  randomly 


selected;  in  //LEX,  the  test  problems  were  the  desig¬ 
nated  test  problems  in  [Keller,  1987]  (problem  sets  AT 
and  BT.)  Then  a  training  set  was  selected,  again  ran¬ 
domly  for  BW,  RW  and  GRID,  and  from  the  desig¬ 
nated  training  problems  (problems  sets  A  and  B)  for 
/iLEX.  Progressively  larger  subsets  of  the  training  set 
were  then  given  to  each  learning  system,  again  as  in¬ 
dicated  in  table  1,  and  the  control  rules  learned  from 
each  subset  were  applied  to  the  problems  in  the  test 
set.  Two  statistics  were  kept.  First,  the  total  time  re¬ 
quired  to  solve  all  of  the  test  problems  was  recorded. 
The  second  statistic  kept  was  the  coverage  of  the  set  of 
control  rules:  the  percentage  of  the  test  problems  that 
the  learner  program  solved  directly,  without  recourse 
to  the  original  program.^  These  experiments  were  re¬ 
peated  ten  times,  and  the  results  were  averaged.  In 
BW,  RW  and  GRID,  each  trial  used  a  different  ran¬ 
domly  selected  training  set;  in  /iLEX,  each  trial  used 
a  different  ordering  of  the  fixed  training  set. 

Approximations  in  the  domains  were  chosen  by  m- 
crementing  k  until  either  some  performance  improve¬ 
ment  occurred,  or  until  the  cost  of  running  AxA-EBL 
became  excessive.  One  of  the  surprises  of  these  ex¬ 
periments  was  the  expressiveness  of  a  language  of  very 
short  approximations:  in  all  of  the  domains  except 
GRID,  substantial  performance  improvement  occurred 
using  AxA-EBL  with  k  <2. 

The  results  of  these  experiments  are  sumnv..  i*: 
table  2.  The  crucial  column  of  table  2  is  the  last  one, 
which  shows  the  ratio  of  the  time  spent  by  the  program 
learned  by  AxA-EBL  in  solving  the  problems  of  the 
test  set  to  the  time  spent  by  the  program  learned  by 
EBL.  AxA-EBL  is  marginally  slower  than  EBL  on  the 
liLEX  domain,  and  substantially  faster  on  the  remain¬ 
ing  domains;  overall,  the  programs  learned  by  AxA- 
EBL  are  about  twice  as  fast  as  the  programs  learned 
by  EBL. 

The  results  of  the  experiment  are  given  in  more  de¬ 
tail  Figure  1  and  Figure  2.  These  figures  graph  the 
average  time  spent  by  a  learned  program,  and  also  the 

- - -  - r---  ril  1  .  1  ,  /*  *1 

centage  of  the  time  that  it  was  not  necessary  to  go  back 
and  use  the  original  program  to  solve  a  problem)  as  a 
function  of  the  number  of  training  examples  given. 

The  three  learning  techniques  behaved  essentially 

■‘Rec.iU  that  the  program  which  incorporates  the  learned 
control  rules  may  fail  if  not  enough  control  rules  have  been 
learned,  or  if  some  of  the  control  rules  are  incorrect.  If  the 
program  fails,  then  the  original  problem  solver  is  invoked 
on  the  problem. 
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Figure  2:  Performance  and  Coverage  Curves  for  RW  and  BW 
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Table  2:  Summary  of  experimented  results 


Domain 

Avera 

Nolearn 

ze  Time 
EBL 

on  Test  F 
A-EBL 

’roblems 

AxA-EBL 

AxA-EBl 

EBL 

JiCEJC' 

RW 

BW 

GRID 

2.919 

9.956 

3.150 

0.137 

0.103 

2.768 

2.553 

0.361 

0.103 

0.922 

4.774 

0.423 

O.104 

1.596 

0.987 

0.268 

1.01 

0.58 

0.39 

0.74 

Average 

4W 

1.447 

Essr 

0.739 

03r 

identically  on  the  /xLEX  domain.  This  domain  is  not 
very  informative  for  the  purpose  of  comparing  the 
techniques:  it  is  included  in  this  paper  as  an  addi¬ 
tional  demonstration  of  the  generality  of  AxA-EBL. 

In  the  GRID  domain,  AxA-EBL  starts  out  with  a 
very  bad  set  of  control  rules  —  one  which  usually  leads 
the  problem  solver  on  a  wild  goose  chase  across  the 
graph  —  but  then  quickly  recovers.  In  all  ten  tri¬ 
als,  AxA-EBL  converges  to  the  same  program;  the 
programs  learned  by  EBL  and  A-EBL  are  still  get¬ 
ting  progressively  slower.  Unfortunately,  the  program 
learned  by  AxA-EBL  is  still  slower  than  the  original 
program.  This  indicates  that  some  additional  utility 
analysis  may  be  necessary,  even  when  approximations 
are  used. 

In  the  RW  and  BW  domains,  AxA-EBL  outperforms 
EBL  by  wide  margins.  Suprisingly,  A-EBL  does  some¬ 
what  better  than  AxA-EBL  in  the  RW  domain.  We 
interpret  this  result  as  follows.  The  coverage  graphs 
show  that  although  AxA-EBL  generates  control  rules 
which  have  lower  match  cost  than  A-EBL,®  AxA-EBL 
converges  somewhat  more  slowly;  this  is  probably  the 
price  of  searching  through  a  much  larger  space  of  pos¬ 
sible  control  rules.  In  short,  AxA-EBL  trades  off  con¬ 
vergence  speed  for  lower  match  cost  in  the  planning 
domains.  This  unfortunately  leads  to  a  performance 
tradeoff:  if  matching  rules  is  cheap  and  problem  solv¬ 
ing  is  expensive,  as  in  the  RW  domain,  then  faster 
convergence  will  offset  the  poorer  quality  of  the  rules, 
and  A-EBL  will  produce  faster  programs. 

5  Related  work 

Use  of  approximations  in  learning  control  knowledge 
was  investigated  in  the  MetaLEX  project  (Keller, 
1987].  The  techniques  used  in  MetaLEX,  however, 
were  specific  to  state-space  search,  and  required  pe¬ 
riodic  testing  of  programs  including  learned  control 

^This  is  obvious  in  the  blocks  world  domdn,  since  AxA- 
EBL's  control  rules  outperform  A-EDL's  control  rules  even 
though  their  coverage  is  usually  worse.  It  is  a  little  more 
difficult  to  verify  in  the  RW  domain.  One  way  to  con¬ 
trol  for  the  effect  of  coverage  is  to  compare  only  sets  of 
control  rules  with  equal  coverage;  this  comparison  shows 
that  AxA-EBL  control  rules  are  about  12%  faster  than 
A-EBL  control  rules  with  equivalent  coverage.  Another  in¬ 
dication  that  the  AxA-EBL  control  rules  are  faster  is  that 
the  best  control  rules  learned  by  AxA-EBL  outperform  the 
best  control  rules  learned  by  A-EBL,  requiring  1.5  seconds 
versus  12.9  seconds  to  solve  the  50  problems  in  the  test  set. 


knowledge  against  a  benchmark.  This  would  be  pro¬ 
hibitively  expensive  in  a  real-world  domain.  Ellman 
[Ellman,  1988],  Tadepalli  [Tadepalli,  1989]  and  Chien 
[Chien,  1989]  have  also  investigated  explanation  based 
learning  of  approximate  theories.  The  search  tech¬ 
niques  described  by  these  authors  improve  on  Keller’s 
by  taking  advantage  of  a  partial  order  on  approximate 
theories;  however,  they  have  not  been  applied  to  the 
problem  of  learning  control  rules. 

Approximations  to  EBL  rules  have  also  been  investi¬ 
gated  in  [Chase  et  o/.,  1989]  in  the  context  of  the  ULS 
system.  The  approximations  done  in  ULS  are  not, 
however,  selected  by  an  inductive  learning  mechanism; 
as  a  consequence,  ULS  is  limited  to  conservative  ap¬ 
proximations  (e.j/.,  dropping  one  or  two  conditions). 
ULS  nonetheless  has  m^e  modest  improvements  in 
planning  time  in  the  RW  domain. 

The  use  of  fc-bounded  approximations  is  one  way 
of  bounding  the  cost  of  learned  rules.  An  alternative 
approach  is  described  by  Tambe  and  Rosenbloom  in 
[Tambe  and  Rosenbloom,  1989];  they  describe  a  syn¬ 
tactic  constraint  on  SOAR  programs  which  ensures 
that  chunks  have  match  cost  linear  in  their  length. 
A  corresponding  constraint  can  be  imposed  on  EBL 
rules;  it  is  easy  to  see  that  requiring  every  operational 
predicate  to  have  at  most  one  solution  will  lead  to 
EBL  rules  which  can  be  matched  in  linear  time.  An 
advantage  to  linear-time  restricted  EBL  is  that  special- 
purpose  inductive  techniques  are  not  needed  to  search 
for  correct  rules;  the  standard  incremental  learning  al¬ 
gorithm  can  be  used. 

In  order  to  satisfy  the  constraint  that  every  opera¬ 
tional  predicate  has  at  most  one  solution,  it  is  typically 
necessary  to  change  the  operationality  predicate.  In 
the  planning  domains,  for  instance,  it  is  sufficient  to 
modify  the  operationality  predicate  so  that  the  holds 
predicate,  which  determines  if  a  condition  is  true  in 
a  state,  is  marked  as  non-operational.  Using  the  new 
operationality  predicate,  control  rules  learned  by  EBL 
are  expressed  in  teiins  of  euustiaiiits  on  elements  of 
the  list  structure  used  to  represent  a  state,  rather 
than  constraints  on  the  conditions  that  are  true  or 
false  in  that  state.  Changing  the  operationality  pred¬ 
icate  in  this  way  closely  corresponds  to  Tambe  and 
Rosenbloom’s  suggestion  of  replacing  multi-attributes 
in  SOAR  programs  with  list-processing  utilities. 

Figure  3  shows  the  result  of  using  linear-time  re¬ 
stricted  EBL  (LR-EBL)  in  the  BW  and  RW  domain. 
This  experiment  points  out  a  disadvantage  of  the  tech- 
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Figure  3:  Performance  and  Coverage  Curves  for  BW  and  RW  with  Linear-Time  Rules 


nique:  learned  rules  tend  to  be  very  specific,  and  hence 
do  not  often  transfer  to  new  situations.  As  a  conse¬ 
quence,  there  is  little  or  no  performance  improvement 
on  a  test  set  of  novel  problems;  and  so  in  both  domains, 
performance  improvement  was  far  less  than  that  at¬ 
tained  by  AxA-EBL.  In  the  RW  domain,  a  modest 
15-20%  improvement  grdn  was  achieved,  whereas  AxA- 
EBL  improved  performance  by  over  80%;  in  the  BW 
domain,  performance  was  degraded  slightly,  whereas 
AxA-EBL  improved  performance  by  65%.® 


6  Conclusions 


The  utility  of  a  rule  is  its  contribution  to  performance 
improvement;  utility  is  directly  proportional  to  the 
coverage  of  a  rule  and  inversely  proportional  to  the 
match  cost.  This  paper  introduces  two  new  techniques 
for  improving  the  utility  of  learned  rules.  The  first 
technique  is  to  combine  EBL  with  inductive  learn¬ 
ing  techniques,  thus  increasing  the  coverage  of  control 
rules;  the  second  technique  is  to  constreiin  these  induc¬ 
tive  techniques  to  learn  approxiTnaic  control  rules  that 
have  very  low  match  cost.  These  two  techniques  have 
been  synthesized  in  the  AxA-EBL  algorithm.  AxA- 
EBL  has  been  experimentally  shown  to'  produce  con¬ 
trol  rules  which  improve  substantially  over  the  control 


*The  training  and  lest  problems  used  for  these  experi¬ 
ments  and  the  experiments  of  Figure  2  are  the  same;  the 
CPU  time  needed  to  solve  the  test  set  without  learning 
diffcrcs  because  with  a  different  operationality  predicate, 
different  domain-specific  predicates  can  be  compiled. 


rules  learned  by  standard  EBL  in  several  domains;  the 
average  improvement  is  about  a  factor  of  two  over  the 
domains  that  have  been  studied. 

Several  open  problems  remam.  First,  the  techniques 
used  are  non-incremental;  it  would  be  desirable  if 
learning  could  take  place  concurrently  with  problem¬ 
solving,  rather  than  as  a  separate  pass.  Second,  the 
current  implementation  of  AxA-EBL  must,  for  reasons 
of  computational  complexity,  use  a  very  small  and  con¬ 
straining  set  of  approximations;  it  would  be  desirable 
to  use  a  larger  set  of  approximations.  Third,  one  cost 
of  learning  approximate  control  rules  seems  to  be  a 
slower  convergence  rate,  relative  to  simpler  learning 
systems.  It  would  be  desirable  if  this  tradeoff  could  be 
lessened  or  avoided. 
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Abstract 

In  many  domains  it  is  computationally  in¬ 
tractable  to  apply  EBL  to  learn  knowledge 
that  is  operational,  complete  and  correct. 
Current  approaches  to  solving  this  prob¬ 
lem  exploit  the  trade-off  between  correct¬ 
ness  and  operationality,  and  introduce  ap¬ 
proximations  to  learn  operational  but  in¬ 
correct  knowledge.  In  this  paper  we  intro¬ 
duce  an  alternative  approach  that  exploits 
the  trade-off  between  completeness  and  op¬ 
erationality  to  learn  knowledge  that  is  both 
operational  and  correct,  but  incomplete, 
since  it  only  covers  a  subset  of  the  origi¬ 
nal  domain.  We  employ  a  “boot-strapping” 
approach  to  incrementally  increase  cover¬ 
age  of  the  domain  by  using  the  knowl¬ 
edge  learned  from  small  problems  to  sim¬ 
plify  learning  from  more  complex  problems. 
Correct  knowledge  is  learned  by  employ¬ 
ing  a  second  order  theory  of  goal  achieve- 
n.ent  to  construct  abstract  proofs  that  are 
useful  for  learning.  These  proofs  are  com¬ 
piled  into  operational  form  by  employing 
simplifying  assumptions.  The  advantage 
of  this  approach  is  that  learned  knowledge 
is  guaranteed  to  be  correct  if  the  simplify¬ 
ing  assumptions  made  during  learning  hold 
during  problem  solving.  We  illustrate  the 
method  in  two-person  games  and  present 
preliminary  results  in  Quinlan’s  lost-in-n- 
ply  classification  problem  (1983). 

1  Introduction 

In  optimization  and  2  person  game  domains  it  is 
computationally  intractable  to  apply  EBL  to  learn 
knowledge  that  is  operational,  correct  and  complete. 

*I  am  grateful  to  my  advisor,  Tom  Dietterich,  for  his 
advice  and  encouragement,  and  to  Prasad  Tadepalli  for 
many  interesting  discussions  and  useful  comments  on  a 
draft  of  this  paper. 


This  is  an  instance  of  the  intraciable  theory  problem 
(Mitchell,  Keller,&  Kedar-Cabelli  1986;  Tadepalli, 
1989).  The  problem  arises  in  these  domains  because 
of  the  resulting  complexity  of  two  steps  in  the  EBL 
method:  (a)  constructing  a  complete  proof  and  (b) 
extracting  a  correct  and  operational  sufficient  con¬ 
dition  from  the  proof. 

•  Generating  the  proof  is  intractable.  In  opti¬ 
mization  problems,  to  prove  that  the  best  possi¬ 
ble  goal  has  been  achieved  requires  demonstrat¬ 
ing  that  all  other  options  lead  to  a  result  of 
lower  value.  In  games,  in  addition  to  optimiz¬ 
ing  the  goal  achieved,  we  must  also  demonstrate 
that  the  goal  will  be  achieved  under  all  possible 
defensive  actions  by  the  opponent.  Hence,  in 
the  worst  case,  proving  that  a  goal  is  achieved 
requires  exploring  an  exponential  number  of  ac¬ 
tions  (in  the  depth  of  the  proof). 

♦  Analyzing  the  explanation  is  intractable.  The 
goal  of  this  analysis  step  is  to  extract  a  suffi¬ 
cient  condition  from  the  proof  that  is  both  op¬ 
erational  and  correct.  This  is  difficult  in  the 
domains  of  interest  because  there  is  a  strong 
tradeoff  between  correctness  and  operationality. 
By  operational,  we  mean  that  the  sufficient  con¬ 
dition  must  be  directly  evaluable  in  the  current 
situation.  Hence  the  operator  applications  in 
the  proof  must  be  excluded  from  the  sufficient 
condition.  When  the  proof  only  includes  ex¬ 
istentially  quantified  operators  this  is  straight 
forward  (Hirsh,  1987).  However,  the  proofs  we 
are  considering  include  universal  quantification 
over  operators.  There  is  no  simple  solution  to 
this  problem  (such  as  treating  the  V  as  an  “and” 
node)  since  the  quantification  is  over  all  possi¬ 
ble  operators  not  just  those  that  occurred  in  the 
particular  example. 

The  principal  approach  to  solving  this  problem 
has  been  to  exploit  the  trade-off  between  correct¬ 
ness  and  efficiency  and  introduce  approximations 
to  learn  incorrect  but  efficient  knowledge  (Ellman, 
1988;  Chien,  1989;  Tadepalli,  1989).  In  Tadepalli 
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(1989),  approximations  are  introduced  by  analyz¬ 
ing  only  incomplete  proofs,  while  in  Ellman  (1988), 
approximations  are  introduced  by  simplifying  the 
initial  domain  theory.  These  systems  have  demon¬ 
strated  the  usefulness  of  the  approach  in  many  do¬ 
mains  including  planning,  chess  end  games,  schedul¬ 
ing  and  hearts.  However,  one  problem  with  these 
approaches  is  that  the  knowledge  learned  is  incorrect 
and  will  produce  errors  when  applied  in  the  perfor¬ 
mance  task.  Moreover,  because  of  the  way  the  errors 
are  introduced,  the  system  does  not  know  the  condi¬ 
tions  under  which  the  compiled  knowledge  is  correct 
and  is  therefore  unable  to  determine  if  an  answer 
provided  is  correct. 

In  this  paper  we  present  an  alternative  approach 
to  learning  in  intractable  domains.  Rather  then 
introduce  approximations,  we  apply  abstractions 
and  simplifications  to  learn  correct  but  incomplete 
knowledge.  The  knowledge  learned  is  incomplete 
because  it  only  applies  to  a  subset  of  the  complete 
domain.  However,  the  knowledge  is  guaranteed  to 
be  correct  for  that  sub-domain  if  the  simplifying  as¬ 
sumptions  made  during  learning  hold  for  that  sub- 
domain.  Examples  of  simplifying  assumptions  in¬ 
clude  assuming  that  only  a  limited  number  of  do¬ 
main  objects  are  in  the  problem  situation. 

The  idea  is  to  progressively  cover  the  domain  by 
learning  correct  knowledge  for  the  simple  cases  first, 
then  employing  this  knowledge  to  simplify  the  learn¬ 
ing  task  for  more  complicated  cases.  This  “boot¬ 
strapping”  approach  relies  on  a  transfer  of  learned 
knowledge  from  simple  cases  to  complex  cases.  The 
performance  task  will  continue  to  make  errors  when¬ 
ever  the  computational  resources  available  are  insuf¬ 
ficient  to  solve  a  given  problem.  However,  as  more 
of  the  problem  instances  are  solved  using  the  more 
efficient  learned  knowledge,  the  overall  number  of 
errors  made  by  the  system  should  decline. 

We  illustrate  our  approach  applied  to  two  person 
games.  In  particular,  we  present  preliminary  results 
applying  the  approach  to  a  classification  problem  in 
a  sub-domain  of  chess:  Quinlan’s  lost-in-n-ply  prob¬ 
lem  (Quinlan,  1983). 

The  remainder  of  this  paper  is  structured  as  fol¬ 
lows:  First,  we  detail  our  approach  in  a  domain  in¬ 
dependent  way.  Second,  we  define  Quinlan’s  lost- 
in-n-ply  problem.  Third,  we  describe  the  method 
in  detail  applied  to  learning  in  lost-in-2-ply.  Fourth, 
we  present  preliminary  empirical  results  for  this  sub- 
domain.  Fifth,  we  discuss  related  work.  Finally,  we 
conclude  with  a  discussion  of  the  current  limitations 
of  the  approach  and  future  work  aimed  at  overcom¬ 
ing  those  limitations. 

2  Approach 

This  discussion  of  our  approach  emphasizes  the  sec¬ 
ond  of  the  two  problems  with  intractable  domains: 
extracting  correct  and  operational  sufficient  condi¬ 


tions  from  proofs.  The  first  problem — constructing 
the  proof — is  discussed  later. 

Before  we  describe  our  approach  to  solving  this 
problem,  it  is  instructive  to  consider  why  it  is  diffi¬ 
cult  to  extract  an  operational  and  correct  sufficient 
condition  from  the  proofs  constructed  by  min-max 
(or  a-/?)  search.  Consider,  for  example,  the  following 
min-max  proof  that  the  maximizing  player,  denoted 
-f ,  has  achieved  an  advantageous  goal  G  in  5  in  2 
ply: 

(0)  achieve(G,S,-,2)  <= 

V  Op.,  3  Op+,  G(do(Op+,do(Op.,S))) 

This  rule  can  be  read  as  ‘The  maximizing  player 
can  achieve  goal  G  in  S'  in  two  ply  because  for  all 
the  minimizing  player’s  moves  (denoted  Op.)  there 
exists  a  move  for  the  maximizing  player  (denoted 
Op+)  that  results  in  G  being  true.”  Note  that  we 
use  a  situation  calculus  encoding  for  operators  where 
S  denotes  the  initial  situation  and  do(Op,S)  denotes 
the  situation  following  the  application  of  the  opera¬ 
tor  Op  in  S. 

In  order  for  the  sufficient  condition  of  this  proof 
to  be  operational  it  must  only  test  properties  of 
the  initial  situation  S.  Therefore,  since  this  proof 
currently  tests  properties  of  a  future  situation  (in 
G(do(Op+,do(Op_,S)))),  we  must  replace  this  form 
with  some  equivalent  (or  sufficient)  condition  that 
only  tests  properties  of  5.  If  both  the  operators 
were  existentially  quantified,  this  could  be  achieved 
within  the  situation  calculus  framework  by  sim¬ 
ply  unfolding  G(do(Op+,do(Op. ,S)))  until  it  ter¬ 
minated  in  literals  that  were  true  in  S.  The  suffi¬ 
cient  condition  would  then  be  the  leaves  of  this  proof 
(Hirsh  1987).  However,  this  technique  only  applies 
when  we  have  existential  quantification,  and  if  we 
apply  it  in  (0)  we  will  produce  a  sufficient  condition 
that  is  over-general  and  therefore  incorrect. 

We  present  a  new  approach  based  on  constructing 
alternative  proofs  that  are  more  useful  for  learning. 
We  construct  these  proofs  by  employing  an  abstract 
second-order  theory  that  defines  how  goals  can  be 
achieved  in  terms  of  influence  relations  among  op¬ 
erators,  goals  and  situations.  There  are  3  primitive 
influence  relations: 

(1)  r]i\s.Gis),Op,S)  O  -'G(S)  A  G(do(Op,5)) 

(2)  6{Xs.Gis),Op,S)  G(5)  A-G(</o(Op,S)) 

(3)  piXs.G{s),Op,S)  ^  G{S)  A  G{do{Op,S)) 
where  r){G,  Op,  S)  can  be  read  as  “if  Op  is  applied  in 
S  then  goal  G  will  be  made  true,”  6(G,  Op, 5)  can 
be  read  as  “if  Op  is  applied  in  S  then  goal  G  will 
be  made  false,”  while  p(G,  Op,S)  can  be  read  as  “if 
Op  is  applied  in  S  then  goal  G  will  be  maintained 
true."  Note  the  use  of  lambda  binding  for  situations 
in  the  goals.  This  notation  is  employed  because  the 
relations  are  second  order:  they  take  a  goal  as  an 
argument  and  evaluate  it  m  two  different  situations, 
in  the  initial  situation  S  and  in  the  situation  fol¬ 
lowing  the  operator  application  do{Op,S).  We  call 
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these  primitives  influence  relations  because  they  de¬ 
fine  the  ways  in  which  the  application  of  operators 
affect  the  truth  of  goals.  Note  that  the  goals  need 
not  be  simple  literals  in  the  domain  theory  like  in 
the  STRIPS  encoding  of  dom2iins,  but  may  be  arbi¬ 
trarily  complex  expressions. 

With  these  primitives,  we  can  construct  axioms 
that  define  how  goals  can  be  achieved  through  the 
application  of  operators.  For  example,  the  axiom 
below  states  the  conditions  under  which  a  goal  G  is 
achieved  when  it  is  -f’s  turn  to  play  in  situation  S: 

(4)  acAtet;e(As.G(s),5,-t-,l)  O 

G{S)  A  30p+,--S{Xs.G(s),0p+,S) 

V -1^(5)  A  30p+,7;(As.G(s),0p+,5) 

There  are  two  cases:  either  G  is  already  true  and 
the  operator  does  not  affect  it,  or  G  is  false  and  the 
operator  makes  the  goal  true.  This  rule  can  be  used 
in  counter  planning  situations  to  prove  that  G  has 
been  achieved  when  G  is  an  advantageous  goal  for 
-1-.  However,  for  complete  counter  planning  we  need 
to  define  the  other  case — when  an  advantageous  goal 
is  achieved  following  the  opponent’s  (denoted  -)  turn 
to  play: 

(5)  achieve{\s.G{s),S,-,l-) 

G(5)  A  VOp_,^5(As.G(s),Op_,5) 
V-.G(5)  A  V0p_,»?(As.G(s),0p_,5) 

This  axiom  is  basically  the  same  as  (4)  but  includes 
universal  quantification  over  the  operators.  These 
axioms  can  in  turn  be  used  to  produce  proofs  of 
gocil  achievement  for  any  depth  of  ply  by  composing 
the  two  axioms.  For  example,  the  composition  for 
achieving  G  in  2  ply  when  the  opponent  is  first  to 
move  is: 

achieve{Xs.G{s),  S,-,2) 

achieve{Xsi.achieve{\s2-G{s2),si,+,  1) 

We  can  produce  proofs  of  goal  achievement  written 
in  terms  of  the  influence  relations  by  unfolding  the 
different  cases  of  axioms  (4)  and  (5).  For  example, 
the  rule  below  is  constructed  by  selecting  the  second 
case  of  (4)  and  combining  it  with  the  first  case  of 
(5): 

achieve{Xs.G{s),S,-,2)  <= 

30p+,T}(Xs.Gis),0p+,S), 

A-iB  0p_ ,  5(  Asi  .3  0p+  ,r]{Xs2  .G(s2 ) ,  0p+  ,si),  Op- ,  S) 

This  rule  can  be  read  as:  “-f  achieves  goal  G  in  5  in 
2  ply  because  -1-  is  threatening  to  achieve  the  goal 
in  1  ply  and  the  opponent  cannot  interfere.”  The 
first  conjunct  in  the  rule  describes  the  threat  of  -t- 
achieving  G,  while  the  second  conjunct  describes  -‘s 
inability  to  prevent  -J-  from  achieving  G. 

Other,  more  complicated  rules  can  likewise  be  con¬ 
structed  for  deeper  ply,  or  for  more  than  one  goal. 
For  example,  the  following  rule  proves  that  one  of 
two  maximizing  goals  is  achieved  in  2  ply: 

ocAiet;e(As.[Gi(5)  VG2(s)],5,-,2)  <?= 
30p+,i]{Xs.Gi{s),0p+,  S), 


A30p+,r){Xs.G2{s),0p+,S), 

A-30p-,S{Xsi.30p+,7]{Xs2.Gi{s2),0p+,si),0p-,S), 
A  S{Xsi.30p+,r](Xs2.G2{s2),0p+,si),0p-,S) 

This  rule  is  the  familiar  notion  of  a  fork — when  there 
is  a  double  threat  (first  two  conjuncts)  and  the  oppo¬ 
nent  cannot  simultaneously  prevent  both  treats  (last 
conjunct).  In  general,  these  rules  provide  a  space  of 
possible  proofs  for  goal  achievement  in  counter  pleui- 
ning  situations. 

The  important  characteristic  of  these  proofs  which 
makes  them  suitable  for  learning  is  that  they  don’t 
explicitly  test  properties  of  future  situations.  The 
only  situation  included  in  any  of  the  literals  in  the 
proof  is  5,  the  initial  situation.  Hence,  we  can  ex¬ 
tract  a  sufficient  condition  from  these  proofs  that 
only  tests  properties  of  the  initial  situation.  How¬ 
ever,  we  still  do  not  have  an  “operational”  suf¬ 
ficient  condition  because  during  match,  we  must 
evaluate  nested  and  universally  quantified  influence 
relations — a  computationally  expensive  task.  In¬ 
deed,  it  appears  that  we  have  gained  little  by  con¬ 
structing  the  alternative  proof.  However,  the  alter¬ 
native  proofs  differ  from  the  min-max  proofs  in  that 
the  operators  are  not  completely  unconstrained — we 
need  only  consider  relevant  operators  that  affect  the 
goals  according  to  the  influence  relations  included 
in  the  proof.  In  compiling  the  alternative  proof  for 
fork  above  we  need  only  consider  operators  of  -  that 
prevent  -f*  from  achieving  each  of  the  two  goals. 

We  introduce  a  new  approach  to  compiling  these 
sufficient  conditions  that  produces  a  simple  pattern 
of  features  that  can  be  efficiently  tested  in  the  ini¬ 
tial  situation.  The  approach  exploits  the  simplifying 
assumptions  mentioned  earlier  to  reduce  the  com¬ 
plexity  of  the  computation.  These  simplifications 
are  constraints  on  the  situation  5  such  as  restric¬ 
tions  on  the  number,  geometriced  arrangement,  or 
properties  of  objects  in  S. 

This  concludes  our  description  of  the  general  ap¬ 
proach.  We  now  describe  an  application  domain 
(Quinlan’s  lost-in-n-ply)  and  illustrate  the  method 
in  detail. 

3  Domain:  Quinlan’s  lost-in-n-ply 

Quinlan’s  lost-in-n-ply  domain  is  a  sub-domain  of 
chess  with  only  4  pieces:  knight  and  king  against 
rook  and  king.  The  performance  task  is  one  of 
classification — given  a  position,  determine  if  it  is 
lost-in-n-ply  for  the  knight  side,  where  a  loss  is 
defined  as  'the  capture  of  the  knight  (without  re¬ 
capture  of  the  rook)  or  check-mate.  Although  this 
domain  is  much  simpler  than  full  chess,  it  can 
present  quite  a  challenge  even  to  the  master-rated 
player  (Kopec  &  Niblett,  1980).  This  is  a  large  do¬ 
main  that  includes  over  11  million  legal  knight-side- 
to-move  positions  and  9  million  rook-side-to-move 
positions.  In  general,  about  half  of  the  rook-side-to- 
move  positions  and  one  fifth  of  the  knight-side-to- 


280  Flann 


Figure  1:  Three  problem  instances  from  lost-in-2-ply,  black  to  play,  white  to  win 

legal-movers, do(Op,S), Side)  /os/(5,5i(fe) 

pseudo-move{Op,S,Side),  kn-l{S,Side) 

A  -'in-check{do(Op,S),Side)  V  check-mate{S,Side) 

pseudo-move(op(FSq,TSq, [Type, Side], empty), S, Side)  ■4=  check-mate{S,Side)  «4= 
on{S,FSq,['^pe,Side]),  in-checkrS,Side), 

A  legal-direciion{Type,Dir),  A  Op,  pseudo-move(Op,S,Side)  =4- 

A  reachable{S,FSq,TSq,Type,Dir),  in-check{do(Op,S),Side) 

A  on{S,TSq,empty) 


Table  1:  Selected  axioms  from  the  chess  domain  theory 


move  positions  are  losses  for  the  knight  side.  How¬ 
ever,  many  of  these  losing  positions  are  lost  within 
small  search  horizons  (i.e.,  small  values  of  n).  Below 
we  give  a  break  down  of  the  percentage  of  positions 
that  are  lost-in-n-ply  for  small  values  of  n: 


We  illustrate  selected  lost-in-2-ply  positions  in 
Figure  1  and  illustrate  selected  axioms  from  the 
chess  domain  theory  in  Table  1. 

4  Method 


Search  depth 

Total  count  of 

Percentage  of 

n 

lost  positions 

total  lost  positions 

1 

3.03  X  10'’ 

58.0  % 

2 

0.76  X  10« 

15.0  % 

3 

0.37  X  10® 

7.0% 

We  apply  our  learning  approach  to  this  problem  by 
learning  the  simple  cases  (small  values  of  n)  first  and 
then  learning  the  more  complicated  cases.  The  sim¬ 
plifying  assumption  we  employ  is  that  all  situations 
contain  only  the  four  playing  pieces. 

In  the  remainder  of  the  paper  we  illustrate  the 
method  solving  the  following  problem: 

Given: 

♦  A  simple  declarative  encoding  of  chess  legal 
moves  and  goals. 

♦  Randomly  drawn  positive  examples  of  lost-in¬ 
n-ply  for  some  n. 

Find: 

♦  A  small  decision  tree  that  efficiently  and  cor¬ 
rectly  classifies  lost-in-n-ply  positions. 


The  learning  method  is  applied  whenever  the  current 
decision  tree  fails  to  classify  a  give  problem  instance. 
The  method  has  three  stages:  (1)  produce  a  proof  of 
goal  achievement,  (2)  produce  an  operational  suffi¬ 
cient  condition  from  the  proof  and  (3)  integrate  this 
sufficient  condition  into  the  decision  tree.  We  il¬ 
lustrate  the  method  learning  from  problem  instance 
stated  in  Figure  1. 

4.1  Produce  a  proof  of  goal  achievement 

The  goal  of  this  stage  is  to  produce  a  proof  of  lost- 
in-2-ply  using  the  abstract  theory  and  the  domain 
TrtT*  /»V»«oo  ?rr>nlArr>pnf  o  +  irvn 

&  A  Vl^  J  AV.  All  ftwak  VVVVAS.' t  ft 

structs  this  proof  by  first  constructing  a  min-max 
search  tree  then  reconstructing  a  proof  of  lost-in-2- 
ply  using  axioms  from  the  abstract  theory.  This  re¬ 
construction  process  is  constrained  by  the  min-max 
search  tree  in  a  manner  similar  to  Minton’s  BBS  pro¬ 
cess  (Minton,  1988).  In  the  final  section  of  the  paper 
we  discuss  an  alternative  approach  that  avoids  first 
constructing  the  min-max  search  tree.  The  proof 
produced  is  given  below: 
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Figure  2:  Compiling  S{Xs.in~ckeck{s),OpJ,S) 


achieve(Xs.losi-in-S-ply(s),  S,  -,2)  <= 
in-check{S), 

A3 Op+ ,  T]{Xs.kn-l(s) ,  Op+ ,  S) , 

A-f30p-6{Xsi30p+,r](Xs2.kn-l{s2),0p+,si),0p-,S), 
A  S(Xs.in~check{s),Op^,S) 

This  proof  can  be  read  as:  “Black  is  lost  in  2  ply 
because  the  black  king  is  in  check,  white  can  cap¬ 
ture  black’s  knight  and  both  black’s  goals  of  pre¬ 
venting  the  check  threat  and  preventing  the  knight 
from  being  captured  cannot  be  achieved.”  The  in¬ 
check  constraints  arise  from  applying  the  following 
additional  axiom  from  the  abstract  theory  to  the 
->in-check{do{Op,S), black)  constraint  in  the  defini¬ 
tion  of  legal-move  in  Table  1.  The  axiom  states  the 
conditions  under  which  a  goal  G  is  guaranteed  to  be 
false  following  an  operator  (Op)  application; 

■nG(do(Op,S))  O 

G(S)  A  30p,  S(Xs.G(s),Op,S) 

V-nG(S)  A  30p,  -^T](Xs.G(s),Op,S) 

4.2  Produce  an  operational  sufficient 
condition 

The  goal  of  this  stage  is  to  find  a  sufficient  condi¬ 
tion  of  this  proof  that  is  efficient  to  evaluate.  The 
target  form  is  a  small  disjunction  of  conjunctions  of 
features,  where  a  feature  is  defined  as  a  conjunction 
of  “operational”  (i.e.,  directly  evaluable)  literals. 

The  first  two  forms  in  the  proof  are  easy  to  com¬ 
pile.  Each  one  is  used  to  define  a  new  feature  by 
extracting  its  sufficient  condition  from  the  example 
bltuatiuli.  The  3  Op+,ii(Xii.  ku-l(i,),Gp+,S)\n:i^uliiLi> 
the  wliite-rook-aUacks-knighi(S)  feature  defined*: 

white-rook- aUacks-knighi(S) 
on(S,SgR, [rook, white]), 

*The  system  simply  generates  “gensyni”  names  for 
the  features,  I  have  used  intuitive  names  to  assist 
understanding. 


A  on(S,SgN,[knighi,black]), 

A  openline(S,SgR,SgN,Dir), 

A  legal-direciion(rook,Dir) 

This  feature  is  true  in  a  situation  S  when  the  white 
rook  is  on  SgR,  the  black  knight  is  on  square  SgN 
and  there  exists  line  of  empty  squares  between  SgR 
and  SgN  along  direction  Dir  such  that  Dir  is  a  legal 
direction  for  the  rook. 

This  results  in  the  following  sufficient  condition: 

achieve(Xs.losi-in-S-ply(s),  S,  -,2)  <= 
white-rook-aitack$-knight(^, 
Awhiie-rook-checks-king(S), 

A-B0p-,6{Xsi.30p+,r){Xs2-kn-l(s2),0p+,si),0p-,S), 
A  6(Xs.in-check(s),Op^,S) 

The  more  challenging  task  for  the  compiler  is  to  com¬ 
pile  the  influence  constraint  over  into  features. 
The  system  first  re-expresses  the  form  as  universals: 

'iOpi , 'iOp2 , 5(Asi .3  Opj,.r){Xs2.kn-l{s2) 0p+ , si ),Q>i ,  S), 
A  6(Xs.in-check{s),Qp2,S) 

Op\i:-Op2 


The  universals  reveal  the  problem  with  compil¬ 
ing  this  form — we  must  consider  all  possible  Opi's 
and  Qp2’s  in  the  influence  relations.  To  simplify  this 
process,  we  exploit  all  the  constraints  that  apply  to 
S  including  contextual  constraints  and  simplifying 
assumptions.  First,  we  know  that  both  white-rook- 
attacks-knight(S)  and  white-rook- checks-king(S)  are 
true  in  S.  Second,  we  apply  the  simplifying  assump¬ 


tion  and  assume  that  the  4  pieces  (black  knight  and 
king,  white  ruuk  and  king)  arc  all  the  pieces  that  cun 


ever  occur  in  S. 


Given  these  constraints  we  first  compile  the  indi¬ 
vidual  influence  relations.  In  order  to  understand 


how  this  can  be  achieved,  it  is  useful  to  review  the 
definition  of  the  influence  primitives  defined  pre¬ 
viously.  Given  a  situation  S  in  which  G  is  true, 
6(Xs.G{s),Op,S)  must  generate  operators  that  when 
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make-false(ln_check,0p1  ,S) 


Figure  3:  Compiling  '^C^i,'iOp2,S{Xs.Gi{s),Opi,S)  AS{Xs.G2{s),Op2,S)  =>  Qpi  ^  Op2 


applied  in  S  make  G  false.  If  we  were  working  in  a 
domain  where  the  goals  are  simple  literals  directly 
modified  by  the  operators  (i.e.,  the  goals  are  literals 
in  the  add  and  delete  lists),  computing  make-false 
would  be  easy;  we  simply  identify  those  operators 
that  have  G  both  in  the  delete  list  and  preondition. 
However,  this  process  is  more  complicated  when  G 
is  some  derived  property  of  the  situation. 

We  employ  an  eager  partial  evaluation  technique 
that  generates  all  possible  operators  in  S  symbol¬ 
ically  and  determines  those  that  result  in  G  be¬ 
ing  false.  The  result  of  this  analysis  for  S(Xs.in- 
ckeck(s),Opl,S)  is  illustrated  graphically  in  Fig¬ 
ure  2.  Note  that  the  technique  produces  4  sets  of  op¬ 
erators:  (la)  where  the  black  knight  takes  the  rook, 
(lb)  where  the  black  king  moves  out  of  check,  (Ic) 
where  the  king  takes  the  rook,  and  (Id)  where  the 
knight  blocks  the  check.  Note  that  additional  con¬ 
straints  are  introduced  to  ensure  that  the  goal  will 
be  false  following  the  operator  application.  When 
moving  the  king  out  of  check,  the  direction  must 
not  be  along  the  line  of  the  check,  nor  must  either 
the  rook  or  the  white  king  be  in  a  position  to  check 
the  black  king  in  its  destination  square. 

Applying  the  technique  to  the  5(Asi.3  Op+t}{Xs2. 
kn-lls2)  ,Op+,si),  0p2,S)  similarly  yields  3  sets  of 

fKo  Irnirrlif  moiroo  rvii#  rvf 

the  way,  (2b)  where  the  black  king  takes  the  rook 
and  (2c)  where  the  knight  takes  the  rook.  Again, 
additional  constraints  are  included  so  that  the  goal 
is  false  following  the  operator  application. 

The  final  stage  of  compiling  this  expression  into 
features  is  to  eliminate  the  VOpi,  'i0p2  expression. 
To  achieve  this  we  do  exhaustive  case  analysis  by 
“multiplying  out”  the  two  sets  of  operators.  We 


unify  each  operator  in  set  (1)  with  each  operator 
in  set  (2)  and  retain  those  that  are  consistent.  This 
process®  is  illustrated  in  Figure  3. 

To  generate  the  sufficient  condition  we  first  define 
features  from  the  resulting  patterns  (in  Figure  3), 
then  negate  the  features  (since  the  intersection  of 
the  operators  must  be  empty).  Simplifying  the  result 
produces  the  following  operational  sufficient  condi¬ 
tion: 

achieve{Xs.lost-in-2-ply{s),  S,  — ,2)  •<= 

[  white-rook-aitacks-knighi{S), 

A  whiie-rook-checks-king{S), 

A  black-king-attacks-rook{S) , 

A  white-king-proiecis-rook{S)] 

V[  white-rook-aUacks-knighi{S), 

A  white-rook- checks-kinglS), 

A  -I  black-king-aitacks-rook{S)] 

4.3  Update  the  classification  procedure 

The  final  stage  of  learning  is  to  update  the  current 
decision  tree.  This  is  achieved  by  employing  an  in¬ 
cremental  version  of  ID3  such  as  IDS  (Utgoff,  1988). 
The  final  decision  tree  produced  for  lost-in-2-ply  is 
illustrated  in  Figure  4  (the  new  subtree  is  circled  on 
the  left). 

5  An  evaluation  in  !ost-in-n-p!y 

We  consider  three  evaluation  criteria,  two  that  eval¬ 
uate  the  performance  component  and  one  that  eval¬ 
uates  the  learning  method.  To  evaluate  the  perfor¬ 
mance  component  we  consider  (1)  the  relationship 

®Note  that  this  process  produces  two  cases,  one  where 
the  knight  takes  the  rook.  In  fact,  this  caise  is  impossible 
due  to  geometry  and  is  eliminated  by  a  later  stage. 
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Figure  4:  The  updated  decision  tree 


Problems  solved 


(a) 


(b) 


(c) 


Figure  5:  Evaluation  in  lost-in-n-ply.  Graph  (a)  iiiubltdte&  tiie  curquiate  problem  solving  time  as  a  fuuetiou 
of  the  number  of  examples  processed,  for  fixed  n,  (b)  illustrates  the  coverage  of  the  decision  tree  as  a  function 
of  the  number  of  examples  processed,  for  fixed  n,  and  (c)  illustrates  the  overall  coverage  as  a  function  of  the 
maximum  n  (search  depth)  compiled. 


284  Flann 


between  the  problem  solving  time  and  the  number 
of  training  examples  processed,  and  (2)  the  relation¬ 
ship  between  the  domain  coverage  of  the  decision 
tree  learned  and  the  number  of  tra  ining  examples 
processed.  To  evaluate  the  learning  method  we  con¬ 
sider  (3)  the  relationship  between  the  time  spent 
learning  and  the  proportion  of  the  domain  covered 
by  the  decision  tree. 

1.  Recognition  time  /  Instances  processed: 
We  performed  an  empirical  study  for  lost-in-n- 
ply:  Let  J„  be  the  set  of  all  positive  instances  of  • 
lost  in  exactly  n  ply.  For  each  n  we  incremen¬ 
tally  learn  a  decision  tree  from  randomly  cho¬ 
sen  examples  from  I„.  Let  T„  be  a  set  of  100 
randomly  drawn  instances  from  After  each 
learning  event  (i.e.,  an  update  of  the  decision 
tree)  we  used  the  tree  to  classify  those  exam¬ 
ples  in  Tn .  During  experimentation  we  recorded 
both  the  cumulative  problem  solving  time  and 
the  average  time  to  solve  the  problems  in  T„. 
In  Figure  5(a)  we  report  the  cumulative  prob¬ 
lem  solving  time  for  the  decision  tree®  for  each 
fixed  n.  The  asymptotic  results  for  r„  are  given 
below: 


Search  depth 

Average  Problem  solving  time 

n 

fJo  Learn 

Decision  'I'ree 

1 

OF" 

5.0mS 

2 

10.0,9 

30.0m5 

The  Decision  Tree  column  for  n  =  2  gives  the 
results  for  the  decision  tree  illustrated  in  Fig¬ 
ure  4. 


2.  Coverage  /  Instances  processed:  We  re¬ 
peated  the  above  experiment,  this  time  record¬ 
ing  the  percentage  of  7)  that  are  classified  by 
the  decision  tree.  These  results  are  included  in 
Figure  5(b).  In  Figure  5(c)  we  give  the  over¬ 
all  coverage  for  the  complete  domain  under  the 
condition  that  for  all  i,  i  <  n,  lost-in-i-ply  has 
achieved  100  %  coverage. 

3.  Learning  time  /  Coverage:  In  the  current 
implementation,  to  learn  from  a  lost-in-n-ply 
example,  the  system  must  construct  a  min-max 
search  tree  of  depth  n.  Since  the  complexity 
of  constructing  this  proof  is  exponential  in  n, 
the  learning  time  grows  exponentially  as  more 
of  the  domain  is  covered  and  complete  coverage 
is  impossible.  We  will  return  to  this  issue  in  the 
final  discussion  section. 

6  Related  Work 

Minton  (1984)  introduced  an  explanation-based 
technique  for  learning  plans  in  games.  The  method 

®Due  to  the  recency  of  this  work,  only  the  n  =  1  and 
n  =  2  cases  have  been  completed. 


learns  rules  that  recognize  opportunities  for  achiev¬ 
ing  goals  via  sequences  of  forced  moves.  This  ap¬ 
proach  was  applied  successfully  to  Go-moku  and  Tic- 
Tac-Toe.  However,  it  was  not  satisfactorily  applied 
to  chess  or  other  complex  games.  One  reason  for  this 
is  that  the  system  employed  a  very  restricted  repre¬ 
sentation  language  for  describing  the  forcing  condi¬ 
tions  underwhich  a  goal  is  achieved.  It  was  diffi¬ 
cult  to  represent  conditions  such  as  “for  all  empty 
squares  sq,  between  the  my  king  and  the  attacking 
piece,  there  does  not  exist  a  piece  that  can  move 
into  sq.”  The  method  was  further  limited  since  it 
was  designed  to  learn  from  only  one  particular  kind 
of  forced  loss — the  simple  fork. 

Tadepalli  (1989)  introduces  an  approach  to  learn¬ 
ing  in  chess  end  games  based  on  learning  optimistic 
plans  that  are  incorrect.  These  are  then  incremen¬ 
tally  refined  upon  failure.  This  approach  and  the 
one  described  here  can  be  viewed  as  opposite  sides 
of  the  lazy /eager  trade-off.  My  approach  eagerly 
computes  all  relevant  interactions  at  compile  time, 
while  Tadepalli’s  approach  is  lazy,  it  assumes  there 
are  no  interactions  between  plans.  The  principal  dis¬ 
advantage  with  the  lazy  approach  is  that  a  burden 
is  placed  on  the  human  trainer  to  correct  the  errors 
introduced  by  the  approximations.  My  approach 
prefers  to  expend  cpu  time  over  human  trainer  time. 
However,  there  is  a  danger  that  learning  will  become 
intractable.  Ultimately,  some  mixed  strategy  may 
be  appropriate. 

Quinlan  (1983)  applied  the  inductive  learning  al¬ 
gorithm  ID3  to  learn  a  decision  tree  for  small  val¬ 
ues  of  n  in  lost-in-n-ply.  This  work  demonstrated 
that  successful  learning  relies  critically  on  the  choice 
of  instance  vocabulary.  Quinlan  spent  a  consider¬ 
able  amount  of  time  “hand  engineering”  sets  of  fea¬ 
tures  for  lcst-in-2-ply  (2  man  weeks)  and  lost-in-3- 
ply  (over  3  man  months),  and  he  gave  up  on  lost-in- 
4-ply.  In  contrast,  the  approach  here  needs  no  spe¬ 
cial  engineering;  the  chess  domain  theory  provided  is 
very  simple  and  succinct.  Hence,  this  approach  re¬ 
duces  the  need  to  perform  “vocabulary  engineering” 
to  achieve  successful  learning. 

Braudaway  and  Tong  (1989)  describe  a  compilar 
tion  method  that  is  similar  to  the  one  described  here. 
The  most  significant  similarity  is  the  use  of  abstract 
case  analysis  to  perform  the  compilation.  In  the 
work  described  here,  the  expressions  that  describe 
sets  of  operators  (such  as  those  operators  that  move 
the  king  out  of  check)  can  be  thought  of  as  abstract 
cases.  Compilation  consists  of  enumerating  all  po.s- 
sible  abstreict  cases  within  a  sufficient  condition  and 
determining  the  interactions  among  them.  This  pro¬ 
cess  is  well  illustrated  in  Figure  3.  Braudaway  and 
Tong  use  a  similar  process  to  compile  a  declarative 
specification  of  a  legal  floor  plan  into  an  efficient 
generator  of  legal  designs.  Here,  the  abstract  cases 
are  partial  solution  generators  and  compilation  in¬ 
volves  simulating  the  generators  to  determine  their 
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interactions.  One  principle  difference  between  the 
two  works  is  that  in  Braudaway  and  Tong’s  system, 
the  abstract  cases  arise  from  an  explicit  hierarchy  of 
structured  objects  (such  as  lines,  corners,  rectangles 
etc),  while  in  the  work  described  here,  the  abstract 
cases  arise  from  partially  evaluating  the  influence  re¬ 
lations. 

7  Discussion 

The  evaluation  strongly  suggests  that  the  method 
can  be  effective.  However,  there  are  two  princi¬ 
pal  drawbacks  with  the  current  implementation:  (a) 
the  performance  task  is  limited  to  classification — a 
more  useful  performance  task  would  be  to  apply  the 
learned  knowledge  to  actually  solve  problems  (i.e., 
play  games),  and  (b)  proof  construction  is  still  in¬ 
tractable.  In  this  section  we  briefly  discuss  a  solu¬ 
tion  to  both  these  problems  that  is  currently  under 
investigation. 

Both  these  difficulties  stem  from  the  same  prob¬ 
lem:  using  the  abstract  theory  only  to  explain  prob¬ 
lem  solving  and  not  to  construct  problem  solving. 
This  approach  is  understandable  initially,  since  the 
abstract  theory  is  extremely  under-constrained  when 
used  in  a  wholly  backward  or  top-down  manner. 
However,  once  learning  has  complied  some  of  the 
proofs  to  patterns,  a  more  forward  or  bottom-up  ap¬ 
proach  may  be  appropriate.  For  example,  stateS  in 
Figure  1  could  be  solved  effectively  bottom-up  using 
the  fork  rule  (given  earlier  in  the  Approach  section) 
once  white-rook-attacks-knighthas  been  learned  from 
statel  and  white-rook-threatens-checkmate  has  been 
learned  from  state2. 

By  constructing  the  proof  bottom-up  we  can  play 
games  by  selecting  at  each  turn  the  operator  that 
leads  to  the  best  goal.  The  bottom-up  approach  can 
also  overcome  the  problem  with  intractable  proof 
construction,  since  constructing  a  proof  for  lost  in 
n  -t- 1  ply  can  exploit  trans/er  from  the  already  com¬ 
piled  proofs  of  lost  in  i  ply,  for  1  <  J  <  n. 

This  paper  has  reported  preliminary  results  for 
this  new  approach.  In  addition  to  exploring  the  use 
of  the  theory  for  constructing  proofs  and  complet¬ 
ing  the  lost-in-n-ply  problem,  other  work  in  progress 
includes:  (1)  applying  the  technique  to  other  sub- 
domains  of  chess  including  some  of  the  standard  end¬ 
game  databases  (Bratko  &  Michie,  1980),  (2)  apply¬ 
ing  the  technique  to  other  game  domains  including 
checkers,  variants  of  chess  and  Go-moku,  (3)  apply¬ 
ing  the  technique  to  some  non-game  domains  such 
as  such  as  scheduling,  and  (4)  developing  a  theory  of 
the  method  that  formalizes  the  expected  behavior  in 
terms  of  characteristics  of  the  application  domain. 
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Abstract 


A  weakness  of  EBL  is  its  inability  to  explain 
when  the  theory  is  incomplete.  This  paper 
presents  a  three-step  approach  to  deal  with 
incomplete  theories  based  on  abduction, 
analogical  reasoning  and  case-based  reasoning. 
Abduction  allows  us  to  explain  an  example  in 
the  context  of  an  incomplete  theory,  without 
modifying  the  theory.  The  simple  variant  of 
analogical  reasoning  used  here  not  only  provides 
an  explanation,  but  also  extends  the  domain 
theory.  We  show  that  the  overhead  imposed  by 
our  method  on  EBL  is  acceptable  (at  most 
polynomial).  The  approach  presented  in  this 
paper  was  implemented  in  a  system  called  LISE 
(Learning  in  Software  Engineering).  LISE  is  a 
system  which  translates  informal  and  non- 
operational  user  requirements  into  formal  and 
operational  software  specifications. 

Keywords:  Explanation-Based 
Learning,  Incomplete  Theory, 
Abduction,  Analytical  Cost  Evaluation 

1  Introduction 

This  paper  addresses  the  problem  of  learning  in  the  EBL 
framework  [Mitchell  et  al.  1986],  [DeJong  et  al.  1986], 
[Ellman  1989]  with  an  incomplete  domain  theory.  The 
proposed  strategy  applies  three  procedures  to  an 
incomplete  explanation  in  order  to  plausibly  complete  it. 
The  procedures  applied  are:  abduction,  analogical 
reasoning  and  case-based  reasoning.  More  specifically, 
when  dealing  with  an  incomplete  explanation,  we  are  first 
trying  to  apply  abduction  to  the  missing  part  of  the 
explanation.  If  this  fails,  analogical  inference  is  applied  to 
complete  the  explanation,  and  should  this  fail  to  make  the 
explanation  complete  -  case-based  reasoning  is  used. 


This  work  was  done  at  the  Knowledge  Acquisition 
Laboratory,  University  of  Ottawa.  The  three  authors  are 
with  the  Ottawa  Machine  Learning  Group  (OMLG). 


The  three-step  approach  was  implemented  in  a  system 
called  LISE  (Learning  in  Software  Engineering).  LISE  is 
a  system  using  EBL  with  an  incomplete  domain  theory  to 
translate  informal  and  non-operational  user  requirements 
into  formal  and  operational  software  specifications. 
Examples  from  LISE  will  be  used  in  this  paper  to 
illustrate  the  three-step  approach. 

LISE  was  successfully  applied  to  the  design  of  a 
specification  for  a  banking  system  and  a  fleet 
management  system. 

2  The  Three-step  Approach  To  Deal 
\‘/ith  Incomplete  Theories 

An  explanation  in  EBL  is  built  using  rules.  The 
antecedents  of  the  rules  are  satisfied  using  facts  from  the 
training  example  or  using  the  consequents  of  other  rules. 
When  the  domain  theory  is  incomplete,  rules  that  would 
be  required  to  complete  a  particular  explanation  might  be 
missing.  If  it  is  the  case,  EBL  will  prepuce  one  or  many 
partial  explanations.  A  partial  explanation  is  an 
explanation  containing  proven  and  unproven  antecedents. 
The  unproven  antecedents  are  antecedents  for  which  no 
facts  were  found  in  the  training  example,  and  which 
cannot  be  proven  in  the  existing  domain  theory. 

An  incomplete  domain  theory  is  recognized  when  one 
or  many  partial  explanations  are  produced  instead  of  a 
complete  explanation.  Partial  explanations  are  built  using 
rules  which  have  in  their  antecedents  some  predicates  in 
common  with  the  training  example  facts.  These  rules  are 
used  in  our  approach  to  build  a  plausible  explanation  of 
the  training  example. 

There  are  three  steps  in  our  approach  to  deal  with  the 
incomplete  domain  theory.  The  first  step  starts  by 
selecting  the  partial  explanation  providing  the  best 
coverage  of  the  training  example.  Next,  abduction  is  used 
to  complete  the  partial  explanation  so  that  a  plausible 
explanation  be  produced  without  having  to  extend  the 
domain  theory. 

The  second  step  in  our  approach  is  applied  when  the 
first  step  does  not  work.  It  starts  by  selecting  the  partial 
explanation  providing  the  best  coverage  of  the  training 
example.  Next,  a  plausible  explanation  is  created  using 
analogical  reasoning  applied  between  the  unproven 


ExpIanatioii'Based  Learning  with  Incomplete  Theories:  A  Three-step  Approach  287 


antecedents  of  the  partial  explanations  and  the  training 
example  facts.  Multiple  partial  explanations  can  also  be 
combined  in  the  creation  of  the  plausible  explanation. 
Finally,  new  rules  are  extracted  from  the  plausible 
explanation  and  added  to  ‘Jie  domain  theory. 

The  third  step  in  aealing  with  an  incomplete  domain 
theory  involves  the  usage  of  a  case-based  system.  It  will 
be  applied  only  if  the  two  previous  steps  did  not  work.  In 
this  step,  thr^  case-based  system  will  retrieve  a  case  and 
adapt  it  to  the  training  example.  The  domain  theory  will 
not  be  extended  in  this  step. 

2.1  Selecting  The  Best  Partial  Explanation 

The  abductive  and  analogical  reasoning  steps  require  that 
the  best  partial  explanation  be  selected.  We  define  the  best 
partial  explanation  to  be  the  partial  explanation  w  '.ich 
provides  the  best  coverage  of  the  training  example 
according  to  a  heuristic  we  developed. 

The  analogical  reasoning  step  may  also  require  that 
one  or  more  additional  partial  explanations  be  selected 
when  a  combination  of  partial  explanations  is  required  to 
build  the  plausible  explanation.  We  addressed  that 
requirement  by  ranking  partial  explanations  generated  for 
a  specific  training  example  according  to  a  score.  The  score 
is  attributed  to  each  partial  explanation  based  on  its 
coverage  of  the  training  example.  The  heuristic  developed 
to  calculate  the  score  is  as  follows; 

a.  reward  a  partial  explanadon  for  each  common  feature 
it  shares  with  the  training  example, 

b.  reward  concise  partial  explanations,  where 
conciseness  is  measured  in  terms  of  inferences 
required  to  prove  a  goal  (the  less  inferences  in  the 
explanation,  the  more  concise  it  is), 

c.  penalize  a  partial  explanation  for  each  of  its 
unproven  leafs, 

d.  penalize  a  partial  explanation  for  each  feature  of  the 
training  example  that  was  left  unaccounted  for,  and, 

e.  penalize  slightly  a  partial  explanation  for  each 
abductive  inference  that  was  used  in  its  construction. 

Our  heuristic  is  based  mainly  on  the  syntactic  nature 
of  the  partial  explanations.  We  are  currently  investigating 
the  validity  of  the  heuristic  from  a  cognitive  science 
viewpoint  We  anticipate  that  a  semantic  measure  will  be 
appropriate,  so  that  the  ranking  of  partial  explanations 
also  takes  into  account  the  goals  of  the  users.  Such  a 
semantic  measure  will  require  the  use  of  background 
knowledge  in  the  ranking  of  the  partial  explanations. 

2.2  Abduction  To  Complete  Partial 
Explanations 

Abduction  is  the  generation  of  hypotheses,  which,  if 
true,  would  explain  observed  facts  [Pople  1973].  More 
precisely,  if  the  rule  q  <-  p  and  the  fact  q  are  given,  then 
the  desired  abductive  conclusion  is  p.  p  can  be 
characterized  as  being  an  hypothesis  because  there  could 


exist  another  rule  o  <-  p '  which  would  have  been  used 
to  derive  Q. 

Abduction  will  be  used  to  complete  the  best  partial 
explanation  selected  by  .our  heuristic.  Considering  the 
training  example  features  as  facts,  we  will  attempt  to 
draw  hypotheses  from  the  unproven  antecedents  using 
abduction.  If  hypotheses  can  be  drawn  for  each  unproven 
antecedent,  the  partial  explanation  will  be  completed. 
Using  the  above  example  of  the  rule  q  <-  p  and  Q,  p 
would  be  an  unproven  antecedent  in  the  partial 
explanation  and  o  would  be  a  fact  given  as  a  uraining 
example  feature.  If  p  is  the  only  unproven  antecedent,  an 
hypothesis  can  be  drawn  to  account  for  the  occurrence  of 
Q  in  the  training  example,  and  the  partial  explanation  can 
be  completed. 

Abduction  is  not  used  to  extend  the  domain  theory. 
The  unproven  antecedents  of  the  rules  used  to  provide  the 
partial  explanations  are  changed  into  proven  antecedent 
based  on  the  abductive  inference.  There  is  no  possibility 
that  the  correemess  of  the  domain  theory  be  jeopardized 
since  no  new  rule  is  created. 

2.3  Analogical  Reasoning  To  Complete 
Partial  Explanations 

The  analogical  reasoning  paradigm  in  our  work  consists 
of  deductive  inferences  made  using  goals  that  are  common 
to  features.  Two  features  having  a  common  goal  are  said 
to  be  analogous.  Goals  are  kept  in  the  domain  theory. 

When  using  analogical  reasoning,  a  plausible 
explanation  is  built  by  first  re-using  the  proven 
antecedents  of  the  best  partial  explanation  as  determined 
by  the  heuristic.  Unproven  antecedents  of  the  partial 
explanation  are  replaced  by  analogous  training  example 
facts.  New  rules  are  extracted  from  the  plausible 
explanation  and  will  be  added  to  the  domain  theory. 

On  certain  occasions,  more  that  one  partial  explanation 
might  be  required  to  build  the  plausible  explanation.  This 
is  because  a  single  partial  explanation  might  explain 
some  features  of  a  training  example  while  leaving  out 
other  features  covered  by  another  partial  explanation.  Our 
approach  provides  a  mean  of  combining  these  multiple 
partial  explanation  into  a  single  plausible  explanation. 
Roughly,  the  proven  antecedents  of  the  partial 
explanations  are  re-used  and  the  unproven  antecedents  are 
replaced  by  analogous  training  example  facts.  When 
partial  explanations  are  combined,  it  sometimes  happens 
that  several  unproven  antecedents  have  no  analogous 
features  in  the  training  example.  An  analysis  of  the  goals 
of  these  unproven  antecedents  is  performed  to  see  if  they 
can  be  substituted  or  removed. 
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2.4  Case-based  Reasoning  Applied  To 
Training  Examples 

The  domain  theory  contains  rules  which  normally  enable 
EBL  to  explain  most  positive  training  examples. 
However,  some  instances  of  concepts  are  not  readily 
explainable  using  these  rules.  These  instances  are 
exceptions  to  general  rules.  These  instances  would 
normally  be  covered  by  a  small  disjunct  according  to 
[Holte  et  al.  1989].  They  also  correspond  to  the  marginals 
of  [Matwin  et  al.  1990].  The  third  step  of  our  approach 
employs  a  case-based  reasoning  system  to  retrieve 
previous  cases  to  apply  to  training  examples  which 
represent  exceptions.  A  previous  case  is  an  extension  to  a 
concept  deHnition  provided  by  the  incomplete  theory. 

The  case-based  system  will  retrieve  a  case  for  a 
training  example  if  there  is  a  match  between  the  features 
of  the  case  and  the  features  of  the  training  example,  and  if 
the  order  of  the  features  of  the  case  is  preserved  in  the 
training  example.  There  is  a  match  between  a  case  feature 
and  a  training  example  feature  when  thename  of  the  case 
feature  is  the  same  as  the  name  of  a  training  example 
feature  and  the  variables  in  the  case  unify  with  the 
constants  in  the  training  example.  The  unification  is 
propagated  to  other  features  of  the  case.  There  will  also  be 
a  match  when  the  constants  in  case  features  are  associated 
with  the  same  constants  in  the  training  example.  Finally, 
there  is  a  match  when  a  feature  in  the  case  can  be 
associated  with  an  analogous  feature  in  the  training 
example  as  long  as  all  the  previous  conditions  hold. 
When  more  than  one  case  is  retrieved,  a  heuristic  similar 
to  the  one  used  to  rank  the  partial  explanations  will  select 
the  most  applicable  one. 

The  cost  of  matching,  the  likelihood  that  a  case  be 
applicable,  and  the  size  of  the  case-base  make  the  case- 
based  approach  secondary  to  the  rule-based  approach  of 
EBL  in  our  methodology  to  deal  with  the  incomplete 
domain  theory.  An  example  of  case-based  reasoning  to 
deal  with  the  incomplete  domain  theory  is  in  [Genest  and 
Matwin  1990]. 

3  An  Example:  The  System  LISE 

LISE  (Learning  In  Software  Engineering)  is  a  system 
inspired  by  the  learning  process  experienced  by  an  analyst 
during  the  analysis  phase.  LISE  uses  EBL  (Explanation- 
Based  Learning)  to  explain  positive  training  examples 
corresponding  to  user,  requirements  using  a  domain  theory 
corresponding  to  a  specification.  When  a  training 
example  (an  instance  of  user  requirement)  can  be 
explained,  the  result  is  the  corresponding  specification 
expressed  using  primitive  operations.  When  a  training 
example  can  not  be  explained,  LISE  will  apply  our  three- 
step  approach  to  construct  a  plausible  explanation  using 
abduction,  analogical  reasoning  or  case-base  reasoning. 

3.1  The  Domain  Theory  In  LISE 


In  LISE,  the  domain  theory  represents  the  specification  of 
an  application.  In  the  particular  application  considered 
here,  the  domain  theory  will  be  the  specification  of  a 
banking  system. 

The  specification  of  the  system  is  given  in  terms  of 
objects  and  operations  applicable  to  objects.  The  objects 
represent  structural  properties  of  the  system.  The 
operations  represent  behavioural  properties  of  the  system. 
The  format  of  our  domain  theory  is  inspired  from  the 
Extended  Semantic  Hierarchy  Model  (SHM+)  described  in 
[Brodie  et  al.  1984]. 

The  domain  theory  consists  of  frames  arranged  in  a 
hierarchy  and  allowing  multiple  inheritance  of  properties. 
Each  frame  specifies  an  object  or  an  operation  using  a  set 
of  properties.  One  common  property  of  all  frames  is  isa 
which  links  a  frame  to  its  parents. 

Frames  representing  objects  are  located  under  the  high- 
level  frame  called  entity.  Similarly,  the  frames 
representing  operations  are  located  under  the  frames 
ACTION  and  transaction.  Each  frame  describing  an 
operation  is  defined  by  a  two  special  properties: 
precondition  and  procedure. 

The  precondition  property  specifies  the  condition  under 
which  the  operation  can  be  applied.  The  procedure 
property  specifies  a  fixed  sequence  of  operations. 
Operations  specified  as  action  in  the  hierarchy  are 
non-primitive  operations.  Non-primitive  operations  are 
defined  by  a  precondition  and  a  procedure  which  includes  a 
single  primitive  operation.  Primitive  operations  are 
atomic  and  they  are  not  defined  anywhere.  Operations 
specified  under  transaction  in  the  hierarchy  are  defined 
by  a  precondition  and  a  procedure  containing  primitive 
and  non-primitive  operations  arranged  in  a  fixed  sequence. 
Figure  1  illustrates  six  frames  taken  from  a  domain 
theory  containing  the  specification  of  a  banking  system. 
Each  frame  of  the  domain  theory  is  transformed  into  a 
rule  prior  to  executing  EBL. 

The  frames  defining  entities  enumerate  properties  of 
the  entity  (e.g.  person  has  a  name).  Some  properties  of 
entities  and  some  actions  possess  goals.  These  goals  are 
included  in  the  domain  theory  of  LISE.  Goals  are 
essential  to  the  analogical  reasoning  process  employed  to 
extend  tlie  domain  theory.  An  instance  of  a  goal  is: 

goal (credit_margin (Person,  Margin) , 
protect_ban)<_interest) 

which  is  read:  the  goal  of  the  credit_margin  property 
of  a  person  is  to  protect  the  bank  interest. 

3.2  An  Example  Of  Abduction 

The  next  scenario  illustrates  the  usage  of  abduction, 
inu-oduced  in  sec.  2.2.,  to  complete  a  partial  explanation. 
The  training  example  is  the  requirement  for  the 
transaction  withdraw  (bob,  100)  where  the  fact 
account  (bob,  acc_l)  waS  replaced  by  client  (bob) 
(Figure  2). 
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person 

iaaj.  ENTITY 
name 
address 
phone_number 

client 

Isa !  person 
account 

goal (account , identify_client) 
credit_margin 

goal(credit_margin, protect_bank_interest) 

safety_box 

necessary_condition : 

(client (Person)  <- 

account (Person,  Account) ) 
or 

(client (Person)  <- 

safety_box (Person, Safety_box) ) 

witlidraw  (Per  son.  Amount) 
isa!  TRANSACTION 
precondition: 

account (Person, Account) 
goal (account, identi£y_client) 
procedure: 

debit (Account, Amount) 

issue_money (Person, Amount) 

goal (issue_money, inc_client_liquid) 

deposit (Person, Amount) 

Isa:  TRANSACTION 
precondition: 

account (Person, Account) 
goal (account, identif y_client) 
procedure: 

receive_money (Person, Amount ) 

goal (receive_money, dec_client_liquid) 

credit (Account, Amount ) 

debit (Account, Amount) 
isa:  ACTION 
precondition : 

balance (Account, Balance) 
goal  (balance,  protect_ban)«_interest) 
Balance  >  Amount 
procedure: 

sub_fm_balance (Account , Amount ) 

goal (sub_fm_balance, record_transaction) 

credit (Account, Amount) 
isa:  ACTION 
precondition:  nil 
procedure: 

add_to_balance (Account,  Amount) 

goal (add_to_balance, record_transaction) 

Figure  1.  Frames  of  the  domain  theory  for  banking 


The  partial  explanation  produced  for  the  training 
example  contains  the  unproven  antecedent 
account  (bob.  Account)  as  shown  in  the  explanation  tree 
of  Hgure  3.  The  partial  explanation  generated  can  be 
changed  into  a  plausible  explanation  by  using  abduction 
to  transform  account  (bob.  Account)  as  an  hypothesis 
for  the  training  example  feature  client  (bob) . 

The  training  example  is: 
withdraw (bob, 100) 

The  facts  are: 
client (bob) 

address (bob, 101_Colonel_by) 
balance (acc_l, 150) 
sub_fm_balance (acc_l, 100) 
issue_money (bob, 100) 

Figure  2.  The  training  example  for  the  abduction  scenario 


withdraw (bob, 100) 


Figure  3.  An  example  of  abduction 

As  usual  in  EBL,  irrelevant  features  of  the  training 
example,  e.g.  address  (bob,  ioi_coionei_By) ,  are  left 
out  of  the  explanation. 

3.3  Examples  Of  Analogical  Reasoning 

Analogical  reasoning  is  used  to  build  a  plausible 
explanation  re-using  parts  of  one  or  several  partial 
explanations  and  replacing  the  unproven  antecedents  by 
training  example  features  sharing  the  same  goal.  The 
parts  re-used  correspond  to  the  antecedents  of  partial 
explanations  found  as  facts  in  the  training  example. 

Tvt’o  examples  of  analogical  reasoning  will  be 
presented.  The  first  example  shows  how  a  plausible 
explanation  can  be  obtained  using  a  partial  explanation 
and  replacing  its  unproven  antecedents  by  analogous 
training  example  facts.  The  second  example  will  show 
how  multiple  partial  explanations  can  be  combined  to 
construct  a  plausible  explanation. 
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3.3.1  Example  1:  The  Transaction  borrow. 

This  process  will  be  illustrated  using  the  example  of 
the  transaction  borrow  (bob,  looo) .  Figure  4  shows  the 
training  example. 

The  domain  theory  is  incomplete  because  the 
specification  of  the  transaction  borrow  required  to  explain 
the  training  example  is  missing.  Consequently,  a  set  of 
partial  explanations  will  be  produced.  Figure  5a  and  5b 
illustrate  the  partial  explanations  that  were  obtained  using 
transactions  withdraw  (Person,  Amount)  and 
deposit (Person,  Amount)  . 

The  training  example  is:  borrow (bob, 1000) . 


deposit (bob, 1000) 


credit (acc  1,1000) 


# 

t 


'  / 

t  y  goi 

4' 


record_ 

transaction 


goal 

add  to  balance(acc  1,1000) 


receive_money (bob,  1000) 


The  facts  are: 

account (bob, acc_l) 
credit_margin (bob, 3500) 
record_loan (bob, 1000) 
issue_money (bob,  1000) 
car (bob, car_of_bob) 
value (car_of_bob, 12000) 

Fig.  4.  The  training  example  for 
borrow (Person, Amount) 


withdraw(bob, 1000) 


Figure  5a.  Partial  explanation  for  bo r  row  produced  using 

withdraw 


Figure  5b.  Partial  explanation  for  borrow  produced  using 

deposit 

The  dashed  lines  in  the  partial  explanations  indicate 
which  antecedents  were  left  unproven.  The  underlined 
antecedents  in  the  partial  explanations  indicate  which 
antecedents  were  found  as  training  example  features.  The 
heuristic  was  used  to  score  each  partial  explanation.  The 
partial  explanation  built  using 
withdraw  (Person,  Amount)  Scored  the  highest  since  it 
contains  more  underlined  antecedents  than  the  partial 
explanation  built  using  deposit  (Person,  Amount)  . 

Figure  6  shows  that  the  proven  antecedents  of 
withdraw  are  re-used  in  the  plausible  explanation  of 
borrow.  The  unproven  antecedents  were  replaced  by 
selected  training  example  features.  A  training  example 
feature  is  selected  to  replace  an  unproven  antecedent  if  it 
shares  the  same  goal.  In  the  borrow  example, 

account (bob, acc_l) ,  issue_money (bob, 100)  and  the 
operator  ">"  were  all  re-used.  The  antecedent 
balance  (Account ,  Balance)  was  replaced  by  the 
training  example  feature  credit_margin  (bob,  35oo) 
because  they  have  the  same  goal:  protecting  the  bank 
interest.  Similarly,  but  for  a  different  goal, 
sub_fm_balance  (Balance,  Amount)  V/aS  replaced  by 
record_ioan  (bob,  1000) .  LISE  asked  the  user  for  a 
name  to  replace  debit  (Account,  Amount)  in  the 
plausible  explanation.  The  user  provided  the  name 
grant_loan. 

Training  example  features  that  were  not  re-used 
nor  selected  are  deemed  irrelevant.  In  the  example, 

car (bob, car_of_bob)  and  value (car_of_bob,  12000) 

are  irrelevanL 

Initially,  a  set  of  partial  explanations  was  obtained 
because  the  domain  theory  did  not  contain  the  frames  for 
borrow  and  for  granL_ioan.  To  circumvent  the 
incompleteness  problem,  a  plausible  explanation  was 
built  from  partial  explanations.  The  next  step  is  to 
synthesize  the  missing  frames  from  the  plausible 
explanation  and  to  insert  them  in  the  domain  theory.  For 
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1  3500  >  1000 1 


Legend:. 

_  Feature  of  partial  explanation  re-used 

i  Selected  training  example  features  replace  unproven  antecedent  since 
_  they  share  the  same  goal 

( - )  Name  generated  by  the  user 

Figure  6.  The  plausible  explanation  for  borrow  produced  using  withdraw 


that  task,  the  structure  of  the  frames  of  the  partial 
explanation  is  used  as  a  guide  to  create  the  new  frames 
from  the  plausible  explanation.  The  structure  of  the  frame 
of  withdraw  is  used  to  Create  the  frame  of  borrow  and  the 
structure  of  the  frame  of  debit  is  used  to  build  the  frame 
of  gr  ant_ioan .  The  frames  are  shown  on  figure  7. 

Name:  grant_loan (Person, Amount) 
isa:  ACTION 
precondition: 

credit_margin (Person, Margin) 

Margin  >  Amount 
procedure: 

record_loan (Person,  Amount) 

Name:  borrow (Person, Amount) 
isa:  TRANSACTION 
precondition: 

account (Person, Account) 
procedure: 

grant_loan (Person,  Amount) 
issue_money (Person,  Amount) 

Figure  7,  The  frames  for  grant_ioan  and  for  borrow 

3.3.2  Example  2:  The  Transaction  transfer 

A  single  partial  explanation  might  contribute  to 
explain  several  features  of  the  training  example  while 
leaving  out  other  relevant  ones.  Such  a  situation  can  be 
suspected  when  several  partial  explanations  match 
different  features  of  the  training  example.  Figure  8 
pictures  the  training  example 


transfer  (Person,  Amount)  which  will  be  learned  using 
two  partial  explanations. 

The  domain  theory  is  incomplete  since  it  does  not 
include  the  frame  of  transfer.  Consequently,  partial 
explanations  are  produced  (figure  9a  and  9b).  It  is 
interesting  to  note  that  the  partial  explanation  obtained 
using  withdraw  covers  some  training  example  features 
(the  underlined  antecedents)  while  the  one  obtained  using 
deposit  covers  other  features.  In  that  case,  we  recognize 
that  a  single  partial  explanation  will  not  provide  enough 
foundation  to  build  the  plausible  explanation.  Both  partial 
explanations  will  be  integrated  to  produce  the  plausible 
explanation  of  transfer  (figure  10). 

The  training  example  is:  transfer (bob, 100) . 

The  facts  are: 

account (bob, acc_l) 
account (bob, acc_2) 
phone(bob, 992-2318) 
balance (acc_l, 350) 
sub_fm_balance (acc_l, 100) 
add_to_balance (acc_2, 100) 

Figure  8.  The  training  example  for 

tilC  trsnssetiOn  ^Pors on  nti '  • 

The  unproven  antecedents  issue_money  of  withdraw, 
and  receive_money  of  deposit,  do  not  raise  any 
problem  since  the  goal  hierarchy  indicates  that  they  can 
be  mutually  removed  when  the  person  and  the  amount 
are  the  same. 
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withdraw  (bob,  100) 


Figure  9a.  Partial  explanation  for  transfer  produced 
using  withdraw 


deposit (bob, 100) 


Figiu-e  9b.  Partial  explanation  for  transfer  produced 
using  deposit 

After  the  plausible  explanation  was  built,  the  frame 
(figure  12)  for  transfer  is  synthesized  and  integrated 
with  the  domain  theory.  It  is  the  only  frame  added  to  the 
domain  theory  since  the  frames  of  debit  and  credit 
already  exists. 

4  Conclusion 

This  paper  has  presented  a  learning  method  in  which  EBL 
is  used  in  concert  with  an  incomplete  domain  theory. 
The  approach  to  deal  with  the  incomplete  lheo.ry  is  by 
integrating  abduction,  analogical  reasoning  and  case-based 
reasoning.  Inasmuch  as  our  system  augments  the 
deductive  closure  of  its  domain  theory  by  adding  rules 
into  it,  it  achieves  knowledge-level  learning.  This  is 
seldom  the  case  for  EBL  systems. 


[Ellman  1989]  mentions  that  the  effort  in  EBL 
research  addresses  the  problems  of  justified  generalization, 
chunking,  operationalization  and  justified  analogy.  With 
regard  to  that  classification,  our  work  addresses  the 
problem  of  operationalization  since  our  objective  is  to 
translate  a  non-operational  expression  (i.e.  user 
requirement)  into  an  operational  one  (i.e.  a  specification). 

[Ellman  1989]  divides  the  methods  to  handle  the 
incomplete  domain  theories  into  the  analytical  methods 
and  the  empirical  methods.  The  usage  of  abduction, 
analogical  reasoning  and  case-based  reasoning  categorizes 
LISE  as  an  analytical  methods.  Other  analytical  methods 
require  that  a  pair  of  training  examples  with  similar 
functions  be  presented  simultaneously  [Hall  1988]  or  they 
need  an  experimentation  theory  to  refine  the  domain 
theory  [Rajamoney  1988]. 

Empirical  methods  deal  with  incomplete  domain 
theories  (e.g.  [Pazzani  1988], [Fawcett  1989])  by 
conjecturing  rules  to  fill  holes  in  the  partial  explanation 
and,  using  subsequent  training  examples,  empirically 
refine  the  conjectured  rules. 

Building  partial  explanations  in  EBL  can  be  very 
expensive  since  the  entire  domain  theory  must  be 
examined.  Our  system  was  provided  with  a  heuristic  to 
limit  the  search  for  partial  explanations.  The  heuristic  is 
to  require  that  the  order  of  features  in  the  training 
examples  be  strictly  equivalent  to  the  order  of  the 
antecedents  in  the  rules.  The  same  heuristic  is  also 
employed  in  our  case-based  reasoning  system. 

The  following  is  an  analysis  of  the  cost  concurred  by 
our  approach.  The  search  of  EBL  produces  a  list  of  p 
partial  explanations  for  a  training  example  composed  of  N 
facts.  To  assign  the  score  to  the  partial  explanation  and  to 
rankthem,  using  sorting,  brought  a  cost  of  o(P  log  p). 
Considering  that  there  is  m  unproven  antecedents  in  the 
best  partial  explanation  and  that  there  is  r  rules  having 
each  unproven  antecedents  in  their  right-hand  side,  we 
obtain  the  cost  of  r*h  for  the  search  related  to  abduction. 
If  the  abduction  fails,  this  cost  will  be  augmented  by  n»m 
which  is  needed  to  solve  the  training  example  using 
analogical  reasoning.  Thus,  the  cost  for  the  first  two 
steps  of  of  our  approach  is  o(P  log  p  +  (r+n)  *  m)  .A 
good  domain  theory  will  provide  partial  explanations  with 
small  values  for  m. 

The  cost  will  increase  more  considerably  if  case-based 
reasoning  is  applied  to  the  training  example.  The  cost  of 
matching  cases  in  our  approach  will  be  polynomial  since 
the  features  are  ordered  in  our  cases  and  in  the  training 
examples.  The  cost  of  assigning  the  score  and  ranking  the 
cases  is  o(c  log  O'  where  c  is  the  number  of  cases 
having  at  least  one  match .  The  cost  of  adapting  cases  is 
s*r  where  S  is  the  number  of  unmatched  case  features 
and  N  is  the  number  of  training  example  facts.  The 
ordering  imposed  on  the  facts  of  the  training  examples. 
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Feature  of  partial  explanation  re-used 
Name  generated  by  the  user 


Figure  10.  The  plausible  explanation  for  transfer  produced  using  withdraw  (the  goals  are  omitted  for  clarity) 


Name:  transfer (Person, Amount) 
isa:  [transaction (Person) ) 
precondition: 

account (Person, Acoountl) 
account (Person, Account2) 
procedure: 

debit (Accountl, Amount) 
credit (Account2, Amount) 

Figure  11.  The  new  frame  for  transfer  added  to  the 
domain  theory. 

on  the  predicates  of  the  rules,  and  on  the  features  of  the 
cases  ensued  a  polynomial  cost  rather  than  the  prohibitive 
NP-complete  problem  of  matching  unordcred  sets  of 
features. 

The  cost  of  abduction  and  analogical  reasoning  being 
of  the  same  order,  we  chose  to  apply  abduction  first  in 
our  approach  because  abduction  docs  not  change  the 
domain  theory  while  analogical  reasoning  does.  We 
believe  that  the  domain  theory  should  be  modified 
conservatively,  only  when  abduction  can  not  transform  a 
partial  explanation  into  a  plausible  one. 

LISE  was  successfully  implemented  in  Prolog.  A 
specification  for  a  small  banking  system  was  developed. 
We  are  currently  working  on  the  specification  of  a  fleet 
management  system 

5  Future  Work 

A  future  goal  of  our  research  is  to  study  the  validity  of 
our  approach  from  a  cognitive  science  viewpoint.  We  are 
interested  in  the  way  people  select  partial  explanations 


when  faced  with  problems  and  how  people  apply 
analogical  reasoning  to  create  new  explanations  from 
previous  ones.  We  hope  to  obtain  an  answer  to  these 
questions  from  a  current  experiment  designed  jointly  with 
a  cognitive  scientist. 

As  for  any  non-empirical  learning  method,  learning  is 
limited  by  the  amount  of  initial  knowledge  contained  in 
the  domain  theory  and  the  case-base.  A  future  goal  for 
this  research  is  to  make  the  system  less  dependent  on  tlie 
initial  knowledge. 
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Abstract 

We  propose  a  solution  to  problem  solving  by 
analogy  which  is  an  alternative  to  Carbonell's 
transformational  analogy.  Given  a  plan  that 
succeeds  for  the  base,  we  apply  the  plan  to  the 
target  and  propose  to  correct  its  failures  by  an 
abductive  .recovery  mechanism  inspired  from 
abducti'  j  ecovery  from  failed  proofs. 

1  Introduction 

The  analogy  scheme  we  shall  use  in  this  paper  is  quite  a 
classical  one  (Winston,  1982;  Gentner,  1983;  Chouraqui, 
1985;  Falkenhaincr,  Forbus,  and  Gentner,  1986; 
Carbonell,  1983, 1986;  Kedar-Cabclli,  1988;  Kodratoff, 
1988).  It  can  be  described  as  follows.  Let  us  suppose  that 
we  dispose  of  a  piece  of  information,  the  base,  that  can 
be  put  into  the  form  of  a  doublet  (A,  B)  in  which  it  is 
known  that  B  depends  on  A.  This  dependency  will  often 
be  causal,  and  it  docs  not  need  to  be  very  formal  nor 
strict.  In  the  following,  we  shall  call  this  relation  P,  and 
refer  to  it  as  the  causality^  of  the  analogy.  Suppose 
now  that  we  find  an  other  piece  of  information,  the 
target,  (A*,  B*)  that  can  be  put  into  the  same  form,  and 
such  that  there  exists  some  resemblance  (similarity)  be¬ 
tween  A  and  A'.  In  the  following,  we  shall  call  this  rela¬ 
tion  a,  and  refer  to  it  as  the  similarity  of  the  analogy. 
Let  us  call  P'  the  causal  dependency  between  A'  and  B', 
and  a'  the  similarity  between  B  and  B',  as  shown  in  the 
figure  below, 

resemblance/difference 

TARGET 

^  X'  dependence 
I  relations 
3 '  ^  (CAUSALI- 


Figure  1:  The  general  scheme  of  analogy 
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relations 
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a 

t 
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I  It  is  also  often  called  the  internal  dependency  of  the  base. 
Arguments  for  calling  it  causality  are  found  in  section  2  and 
in  Kodratoff  (1990). 


The  very  first  idea  one  can  try  in  applying  this  scheme  to 
problem  solving  is  to  consider  that  A  is  the  initial  state 
of  the  system,  B  is  its  final  state,  and  P  is  a  sequence  of 
applications  of  operators  leading  from  A  to  B.  The  anal¬ 
ogy  problem  then  becomes  to  find  a  P'  allowing  to  go 
from  a  new  initial  state.  A',  similar  to  A,  to  a  new  final 
state  B*.  This  view  is  summarized  in  figure  2a  below. 


BASE 
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Final  state  of 
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TARGET 
Initial  state  of 
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looked  for 
solution 


Final  state  of 
target  problem 


Figure  2a:  A  classical  view  of  the  use  of  analogy  for 
problem  solving. 
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Figure  2b:  Our  scheme  for  using  analogy  in  problem 
solving.  It  requires  a  plan  for  a  solution,  instead  of  a 
particular  solution  as  in  figure  2a. 


In  this  paper,  we  would  like  to  present  a  somewhat  differ¬ 
ent  view  of  problem  solving,  in  which  A  is  the  set  of 
means  of  the  source  problem,  and  B  the  set  of  its  goals, 
and  p  is  a  plan  for  going  from  A  to  B.  This  view  is 
summarized  in  figure  2b,  above.  The  means  of  the  prob¬ 
lem  are  made  of  all  the  knowledge  necessary  to  solve  the 
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problem,  and  of  the  instantiations  particular  to  the  prob¬ 
lem  at  hand.  The  goals  of  the  problem  is  the  set  of  the 
consequences,  interesting  for  solving  the  problem  at 
hand,  of  applying  plan  p  to  A.  In  the  following  we  shall 
elaborate  an  example  coming  from  (D&Tong  and  Mooney, 
1986).  The  means  of  the  base  arc  abducting  a  rich 
person's  child,  and  the  goals  be  kidnapper  becomes  rich. 
We  are  given  a  plan  (die  one  of  DeJong  and  Mooney, 
1986)  that  solves  the  problem  of  achieving  goal  B  by 
means  A.  This  plan  gives  the  cause  why  it  is  possible  to 
achieve  goal  B  by  means  A,  it  will  therefore  be  seen  here 
as  the  causality  p  linking  A  and  B.  We  are  also  given  an 
other  problem,  i.e.,  goals  B'  and  means  A'  (e.g..  A'  may 
be:  abducting  a  famous  politician,  and  B'  may  be: 
terrorists  advertise  their  political  cause).  The  analogical 
problem  is  then  to  use  the  plan  p  in  order  to  invent  a  new 
plan  P'  that  will  achieve  B’  by  A'  (e.g.,  how  terrorists  can 
get  advertisement  by  abducting  a  famous  politician?). 

Our  proposal  for  finding  which  transformations  to  apply 
to  plan  in  order  to  obtain  plan  p'  is  precisely  to  attempt 
applying  plan  p  to  A',  antdyze  the  partial  successes  and 
failures  of  this  application,  and  induce  from  them  a  P* 
that  will  include  the  successes  and  eliminate  the  failures. 
Thus,  recovering  from  a  failed  analogy  is  central  in  our 
view  of  problem  solving  by  analogy. 

We  shall  present  here  a  scheme  which  is  very  near  to  ab- 
ductive  recovery  of  proof  failures  as  presented  by  Cox  and 
Pietrzykowski  (1986),  Duval  and  Kodratoff  (1989).  A  de¬ 
tailed  example  of  abductive  recovery  is  presented  in  sec¬ 
tion  6.  The  fundamental  technique  implementing  such  a 
recovery  system  is  the  inversion  of  resolution  as  de¬ 
scribed  in  ^uggleton  and  Buntine,  1988;  Rouveirol  and 
Puget,  1989).  Let  us  now  see  what  these  techniques  are 
precisely,  and  how  they  can  be  applied  to  recovery  from 
plan  failures. 

2  Causal  knowledge  in  creative 
analogies 


increasing  the  relevance  of  the  analogy.  Conversely, 
adding  information  like  ma/e(Louis)  &  borrt_jn(Louis, 
France)  and  /(!ffw/e(Antoinette)  &  horrt_/rt(Antoinette, 
Austria)  will  decrease  its  relevance.  With  this  added 


information,  the  analogy 
causality,  as  follows. 

nationality 
(Louis,  France) 

&  livcs_in( 

Louis,  France)  - 

&bom_in 
(Louis,  France) 

&...  ^ 

nadve.l^guage 
(Louis,  French) 


can  be  written  without 


nationality 
(Antoinette,  France) 
&  livesjn 

^  (Antoinette,  Fiance) 
&b<xn_in 

(Antoinette,  Austria) 

ft...  I 

native_language 
(Antoinette,  French) 


Figure  3.  The  given  analogy,  without  causality. 

On  the  contrary,  one  can  also  consider  that  some  of  this 
information  is  causal.  It  will  allow  us  to  find  back  the 
given  analogy  when  one  considers  that  Pi  = 
lives Jn(Lx>\iis,  France)  and  P'l  = /ivM_irt( Antoinette, 
France)  as  causalities  for  the  fact  of  being  native  French 
speaker,  and  when  one  does  not  take  into  account  that 
Antoinette  is  bom  in  Austria. 

nationality 
(Louis,  France) 

pi  =lives_in 
I  (Louis,  France) 

native_language 
(Louis,  French) 


nationality 
(Antoinette,  France) 

PT=livesJn 
(Antoinette,  France) 


native_language 
(Antoinette,  French) 


In  the  case  of  recognition  and  evaluation  of  existing 
analogies,  there  are  no  needs  to  draw  a  difference  between 
similarity  and  causality.  In  that  case,  causality  is  just  one 
more  similarity  between  source  and  target.  On  the 
contrary,  causality  is  central  to  the  generation  of  new 
analogies. 

As  an  illustration,  consider  the  following  analogy, 
proposed  in  (Russell,  1989). 

From  nationalityiLouis,  France)  &  nationality 
(Antoinette,  France)  &  nativeJanguageQ^om,  French), 
Russel  (1989)  finds  by  analogy  that 
/itl//vC_/ah^uag<;(AiUoincUe,  French).  Adding  new 
information  about  Ix>uis  and  Antoinette  (we  assume  here 
that  these  characters  are  the  royal  couple  sent  to  the 
guillotine  during  the  French  revolution,  thus  taking 
Antoinette  for  Marie-Antoinette),  like  //ves_m(Louis, 
France)  and  //vesj«(Antoinette,  France)  will  increase  the 
similarity  between  Louis  and  Antoinette,  therefore 


Figure  4.  Inventing  again  the  given  analogy  by  using  a 
causality  of  the  form  ”x  livesjn  y"  in  order  to  explain 
that  "native_language(x)  =  language(y)". 

Cl  sider  now  that  one  adds  the  following  information 
a'  Louis:  6orn_/«(Louis,  France).  Then,  the  similar 
i.  /ation  about  Antoinette,  P'2  =  6orn_i«( Antoinette, 
Au  j-ia),  leads  to  the  analogy  native Janguage 
(Antoinette,  German),  or  to  fluentjn  (Antoinette, 
French),  depending  on  the  causality  to  be  us^.  If  P2  and 
P'2  are  considered  as  causa!  and  Pi  and  P'l  are  considered 
as  factual,  then  the  analogy  should  give 
nativeJanguageiAntoinevte,  German). 

In  this  analogy,  one  is  implicitly  using  theorems  of  the 
kind:  Vx  [bornJn(x,  France)  =>  nr'ivejanguage{x, 
French)]  and"  V  x  [born_in{x,  Austria)  => 
native Janguage{x,  German)]. 
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nationality  ^  ^  nationality 

(Louis,  France)  ^  (Antoinette,  France) 

&lives_in  &lives_in 

(Louis,  France)  (Antoinette,  France) 


m2=:bom_in 
JOLouis,  France) 


jP'2=bom_in 
l(Antoineue,  Austria) 


native_language  native_language 

(Louis,  French)  (Antoinette,  German) 

Figure  S.  Inventing  another  analogy  by  using  a  causality 
of  the  form  "x  bomjn  y"  in  order  to  explain 
"native_language(x)  =  language_of(y)". 


The  choice  of  using  these  theorems  follows  from  the 
choice  of  causality.  Let  us  show  why  in  three  steps. 

First  step:  Understanding  causality.  In  the  present  case, 
the  causality  is  P2  =  ftor«_/n(Louis,  France),  which 
"explains"  why  nativeJanguageQ^oms,  French).  From 
this,  we  can  infer  that  Ae  analogy  has  l^n  using  ways 
of  deducing  the  result  from  its  causality.  Therefore,  we 
have  to  consider  theorems  that  have  a  generalization  of 
bornJn(Jjoviis,  France)  in  their  premise,  and  that  have  a 
generalization  of  nalive_language(LQais,  French)  in  their 
conclusion.  In  other  words,  we  have  to  consider  the 
different  ways  by  which  one  might  prove  something  of 
the  form  naUveJanguage(x,  y)  from  something  of  the 
form  bornJn(xTy).  This  may  be  very  difficult,  and  the 
difficulty  of  finding  the  link  between  the  causality  and  its 
consequences  may  become  a  huge  task  by  itself.  In  the 
very  case  we  are  looking  at  presently,  this  inference  can 
be  done  in  a  single  step  by  using  the  theorem  Vx 
[bornjnix,  France)  =>  native Jangmge(yi,  French)]. 
Second  step:  Using  similarity.  Similarity  tells  us  that 
Louis  in  the  base  must  be  replaced  by  Antoinette  in  the 
target.  Therefore,  we  ^ess  that  the  causality  in  the  target 
is  P'2  =  f»orrtj«(Antoinette,  Austria). 

Third  step:  Combining  causality  and  similarity.  We  look 
for  theorems  the  premise  of  which  is  a  generalization  of 
born_j«(Antoinctte,  Austria),  and  the  conclusion  of 
which  is  a  generalization  of  nativejanguage{x,  y).  Once 
more,  this  step  may  be  very  complicated  but,  in  this 
case,  we  find  in  one  step  that  Vx  [bornJn(x,  Ausuia)  => 
native  Janguage{x,  German)]  is  the  looked  for  theorem. 
Applying  it  to  the  premise  bornjn(Antoinette,  Austria) 
leads  to  the  conclusion  native Janguage{kn\.o\n&i\&, 
German),  which  becomes  the  conclusion  of  our  analogy, 
as  shown  in  figure  5. 

Consider  now  that  Pi  and  P'l  are  causal  and  that  P2  and 
P’2  are  factual.  Then,  the  analogy  should  give 
j7Me/j/_w(Antoinette,  French). 


nationality 
(Louis,  France) 
&  bomjn 
(Louis,  France) 


I  pi  =livesjn 
JOLouis,  France) 


nationality 
(Antoinette,  France) 

&  bomjn 

(Antoinette,  Austria) 

P'l=IivesJn 
I  (Antoinette,  France) 


native_language  fluentjn 

(Louis,  French)  (Antoinette,  French) 

Figure  6.  Inventing  another  analogy  by  using  a  causality 
of  the  form  "x  livesjn  y".  In  the  context  of  "bomJn(x, 
y)  this  explains  "native Janguage(x)  =  Ianguage_of(yy'. 
In  the  context  of  "NOT  bomJn(x,  y),  this  explains 
"fluentjn(x,  language_of(y))". 


In  this  analogy,  the  theorems  implicitly  used  are:  Vx 
[tives_in{x,  France)  &  born_in{x,  France)  => 
nativejanguage(x,  French)],  Vx  [7ivM_/«(x,  France)  & 
-tbornjnix,  France)  =»  fluent Jn{x,  French)].  Applying 
the  above  three  steps  in  tlie  same  way  would  lead  us  to 
choose  these  theorems  (instead  of  the  two  above).  In 
other  words,  we  can  say  that  we  have  been  using 
theorems  the  left-hand  side  of  which  can  be  instantiated 
by  //vMjrt(Antoinette,  France) ,  such  as  Vx  [Hvesjn(x, 
Francc)  & ..  => ,.]. 


When  creating  analogies,  the  choice  of  an  information  as 
causality  will  orientate  the  invention  process,  on  the 
contrary,  when  analyzing  existing  analogies,  some 
informations  are  more  relevant  all  kinds  of  information 
play  a  role  in  rating  the  given  analogy.  For  instance,  in 
the  case  of  the  given  analogy  above,  one  might  well  use 
both  informations  lives JniLouis,  France)  and 
bornJnQjows,  France)  to  rate  the  given  analogy.  On  the 
cont^y,  when  creating  analogies,  one  has  to  choose 
between  the  available  informations  which  one  is  of  causal 
nature,  and  this  choice  changes  the  output  of  the  analogy 
process.  In  other  words,  from  the  analysis  point  of  view, 
native _language{Anlomstl&,  German)  is  as  good  an 
analogy  as  fluentJniAnloinelle,  French),  while  from  the 
invention  point  of  view,  they  differ  in  the  information 
that  has  bran  chosen  as  causal.  In  practise,  one  should 
always  dispose  of  large  amounts  of  theorems  such  as 
those  exemplified  in  example  1,  possibly  even  of 
theorems  that  contradict  each  other.  Analogy,  which 
contains  for  us  a  choice  of  causality,  allows  to  choose 
which  to  use. 


o  AiivviaiCii  ui  icauiuiiuii 

Let  us  start  with  an  initial  theory  T.  Suppose  that  T 
meets  an  example  E  that  it  cannot  explain,  because  E 
cannot  be  derived  from  T.  In  this  case,  one  solution  is  to 
consider  that  T  is  incomplete.  Induction  then  amounts  to 
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find  a  new  theory  T  that  allows  to  derive  both  the  initial 
theory  T  and  the  example  E. 

Inversion  of  resolution  is  based  on  three  basic  operators, 
called  absorption,  intraconstruction  and  truncation 
(Muggleton  and  Buntine,  1988;  Rouvcirol  and  Puget, 
1989).  Let  us  illustrate  how  inversion  of  resolution 
works  by  the  following  theory  defining  family  relation¬ 
ships.  Since  inversion  of  resolution  has  been  developing 
up  to  now  in  PROLOG,  we  shall  describe  these  examples 
in  PROLOG  notation. 

Ti:  grandfather(XJ^  father(X,Y),  father(Y,Z). 

T2:  father(X,Y)  child_of(Y,X),  sex(X,male). 

T3:  mother(X,Y)  child_of(Y,X),  sex(X,female). 

Suppose  now  that  the  following  example  is  met. 

E:  grandfather(tom,liz);-  father(tom,hclcn),  childjofOiz, 
helen),  sex(helcn,female). 

It  is  clearly  not  entailed  by  the  available  ^eory.  Let  us 
show  how  inversion  of  resolution  allows  to  induce  form 
T  and  E  a  new  theory,  T',  that  entails  T  and  E. 

3.1  Absorption 

Absorption  of  the  clause  Ti  of  the  theory  by  the  example 
E  is  possible  if  the  body  of  the  clause  Ti  can  be  unified 
with  a  part  of  the  body  of  E.  In  the  example,  only  one 
clause  of  the  theory  can  be  absorbed  by  the  example,  T3 
because  its  body  child_of(YX),  sex(X,  female)  can  be 
unified  with  the  part  of  E  child_of(lh,  helen),  sex(helen, 
female)  with  the  substitution  {Y/helcn,  X/lizj 
Absotpdon  then  replaces  the  body  of  the  absorbed  clause 
Ti  by  its  head  properly  substituted  in  the  body  of  the  ab¬ 
sorbent  clause  E.  This  gives  a  new  form  to  the  example. 
E':  grandfather(tom,liz) father(  tom,  helen),  mother 

(helen,  liz). 

3.2  Intraconstruction 

It  can  occur  between  two  clauses  Ci  and  C2  if  the  heads 
of  Cl  and  C2  match,  and  their  bodies  match  partially.  It 
proceeds  in  three  steps. 

Firstly,  it  generates  a  new  clause  Tia^empo 
the  head  of  which  is  the  least  generalization  of  the  heads 
of  Cl  and  C2.  Its  body  is  the  least  generalization  of 
common  parts  of  the  bodies  of  Ci  and  C2.  For  instance, 
E'  and  Ti  can  undergo  the  first  step  of  intraconstruction, 
giving  the  following  clause. 

Tiatempo-  grand-father(U,W):-  fathcr(U,V), 

The  second  step  takes  care  of  the  left-over  lit¬ 
erals  in  Cl  and  C2.  Here,  the  literal  mother(helen,li2)  of 
E’ and  and  falher(Y^)  of  Ti  have  been  left  over  during 
the  first  step  of  intraconstruction.  Intraconstruction  then 
introduces  a  new  literal,  arbitrarily  called  newp  that  be¬ 
comes  the  head  of  these  left-over.  The  arguments  of  newp 
have  to  be  carefully  chosen  to  keep  the  variables  bindings 
that  were  present  in  Ci  and  C2.  Note  that  the  two  clauses 


thus  built  define  the  predicate  newp  in  extension,  i.e.,  no 
generalization  has  b^n  occurring.  In  our  example,  these 
new  clauses  are 

Tib:  newp(Y,Z)  :-  fa'Jier(Y7?)- 

Ticjempo  •  newp(helen,liz)  :-  mother(helen,liz). 

The  third  step  of  intraconstruction  generates  a 
new  version  of  the  clause  of  the  theory  which  has  been 
undergoing  intraconstruction  (here,  Ti ),  by  replacing  the 
left-over  part  of  this  clause  by  the  new  predicate  newp, 
taking  again  care  of  introducing  tlie  cmrect  variable  bind¬ 
ing.  This  generates  the  clause 
Tia:  grand-father(U,W)  :-  father(U,V).newp(V,W). 

3.3  Truncation 

The  truncation  operator  is  a  generalization  operator  which 
must  be  controlled  in  some  way.  In  our  example,  it  re¬ 
places  constants  by  variables  in  order  to  give  the  same 
degree  of  generality  to  the  clauses  generated  by  intracon¬ 
struction.  Applying  truncation  to  TiC(g„ipo 
mwe  general  clause 

Tic:  newp(Y,Z)  :-  mother(Y,^. 

4  Formation  of  a  new  theory 

The  new  theory  T  is  formed  by  deleting  from  T  the  orig¬ 
inal  clause  that  underwent  intraconstruction,  and  by 
adding  to  T  the  clauses  generated  by  intracohsiruction  and 
truncation.  In  our  example,.this  gives  the  set  of  clauses 
(Tla.  Tib,  Tic,  T2,  T3).  T  is  able  to  recognize  the  new 
example^  and  all  other  examples  of  maternal  grandfathers. 
It  is  ^so  possible  to  consult  an  oracle,  who  might  pro¬ 
pose  to  call  newp  by  the  name  parent.  This  is  useful  for 
the  sake  of  knowledge  base  readability. 

We  consider  here  the  case  where  the  theory  has  already 
been  used  at  least  once  with  success.  We  shall  use  this 
positive  past  experience  in  order  to  drive  the  inversion  of 
resolution.  For  example,  in  problem  solving  by  analogy, 
the  base  case  is  such  a  success.  Suppose  now,  that  new 
problems  are  given  to  the  system,  and  that  it  is  unable  to 
solve  them.  Similarly,  a  theory  can  be  able  or  not  to 
"recognize"  an  example  as  belonging  to  the  theory.  If  it 
fails  to  do  so,  we  can  consider  that  the  reason  of  the  fail¬ 
ure  is  the  incompleteness  of  our  theory,  and  we  will  ac¬ 
cordingly  attempt  to  make  it  complete.  We  propose  here 
to  increase  a  theory  by  two  different  abduction  mecha¬ 
nisms. 

Suppose  that  we  start  with  a  theory  Tho  and  an  example 
Eq  of  a  concept  C,  such  that  Eq  can  be  proven  from  ThQ. 
This  proof  generates  a  complete  proof  tree  Pcq.  Using 
now  classical  EBG  techniques  (Mitchell  et  al.,  1986), 
this  proof  tree  can  be  generalized  enough  to  keep  the 
proof  sufficiency  to  cover  the  example.  Suppose  also  that 
we  meet  now  another  example  Ei  of  C,  such  that  it  is 
not  recognized  by  ThQ.  In  that  case,  we  will  suppose  here 
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that,  nevertheless,  a  partial  proof  tree  Ppi  can  be  gener¬ 
ated  for  El.  Our  solution  to  abduction  consists  in  propos¬ 
ing  abduction  mechanisms  that  will  make  complete  Ppi, 
thus  obtaining  a  complete  proof  Pci.  Our  rule  to  control 
abduction  is  to  try  to  obtain  a  Pci  which  as  "close"  as 
possible  from  Pq.  When  a  Pci  has  been  obtained,  inver¬ 
sion  of  resolution  allows  to  complete  the  theory  accord¬ 
ingly. 

The  following  example  inspired  from  Delong  and 
Mooney  (1986)  will  be  used  as  an  illustration.  In  this  ex¬ 
ample,  our  aim  is  learning  a  definition  of  the  concept  of 
suicide  Kill(x,x).  The  domain  theory  Tho  contains  the 
following  rules 

Theory  ThQ 

kill(A3)  hate(A,B),  possess 

(A,C),  shot-gun(C). 

hate(W,W)  :-  depre^CW). 

possess(U,V)  :-  buy(U,V). 

where  A,  B,  C,  U,  V,  W,  Z  are  variables. 

The  training  instance  Eq  is  a  suicide,  described  by  the  fol¬ 
lowing  facts. 

^0 

depressed(john). 

buy(john,obji). 

shot-gun(obji). 

:-  killOohnjohn). 

Let  us  call  Pcq  the  proof  that  Eq  is  a  consequence  of  the 
theory  ThQ.  Its  proof  tree  can  be  generalized  as  follows 
(Delong  and  Mooney,  1986). 


kill(X,X) 


hate(X,X)  possess(X,C)  shot-gun(C) 

d<pessed(X)  buy(X,C) 

Figure  7.  Generalized  proof  that  John  has  been  commit¬ 
ting  suicide. 

Suppose  now  that  the  system  is  provided  with  an  exam¬ 
ple  El  of  concept  of  suicide  which  is  not  recognized  by 
the  theory.  Supposing  that  a  partial  proof  Ppi  of  Ei  can 
be  obtained,  we  will  try  to  complete  each  of  these  Ppi  in 
such  a  way  that  it  becomes  are  as  "close"  as  possible  of 
Pcq. 

As  an  illustration  of  such  a  Ei,  consider  an  other  suicide 
instance  TP.i  described  by  the  following  facts. 

El 

de{X'essed(mary). 
buy(mary,obji). 
sleeping-pills(obji). 
price(obji,  6). 


where  obji  is  a  constant 

We  will  be  unable  to  prove  kill(mary,mary)  because  she 
has  no  shot-gun.  Nevertheless  we  obtain  one  partial 
proof,  and  only  one  in  this  case. 

kill(maty,mary) 

hate(inary,.i.,jy)  ^  possess Jaryi^jl)  '  ? 

I  I 

dqxBssed(mary)  buy(mary,objl) 

Figure  8.  Ppr  Partial  explanation  of  Mary’s  suicide. 
Ppi  obviously  matches  a  sub-tree  of  Pcq. 

First  induction  (abduction  the  missing  part) 

We  attempt  to  complete  Ppi  by  viewing  Pci  as  an  in¬ 
stance  of  Pcq,  this  is  one  of  the  possible  deHnitions  of 
"closeness".  Therefore,  Ppi  will  be  completed  by  taking 
the  missing  pieces  from  Pcq,  ^propriately  instantiated. 

In  our  example,  such  a  forced  matching  leads  to  Pci,  in 
which  the  missing  part  of  Ppi  has  been  replaced  by  shot- 
gun(obji).  In  this  first  abduction,  the  cause  of  the  failure 
is  attributed  to  our  supposed  "ignorance"  that  obji  (i.e. 
sleeping-pills)  is  actually  a  shot-gun.  This  mechanism 
has  alr^y  b^n  considered  in  other  works  about  abduc¬ 
tion  such  as  Cox  and  Pietrzykowski  (1986).  Our  example 
shows  that  it  can  be  quite  a  dangerous  step  to  do  since  it 
leads  to  complete  the  theoiy  by  adding  the  "fact" 
shot-gun(obji). 

amounting  to  state  that  sleeping-pills  are  kinds  of  shot¬ 
guns. 

In  the  view  of  using  this  abduction  in  an  analogical  pro¬ 
cess,  it  will  be  quite  easy  to  check  if  this  abduction  al¬ 
lows  to  complete  the  solution  of  the  target  problem.  If  it 
does  not,  we  propose  to  use  another  kind  of  induction,  in 
which  Pcq  and  Pci  are  not  supposed  to  match. 

Second  induction  (induction  of  the  missing  theorem) 

In  this  case,  we  try  to  add  a  new  rule  that  will  allow  to 
complete  the  proof.  One  easily  understand  why  this 
mechanism  has  not  been  taken  into  account  so  far;  in 
principle  one  can  add  so  many  ridiculous  rules  that  this 
approach  seems  to  be  hopeless. 

For  instance,  in  our  example  adding  to  Tho  nile 
kill(X,X)  :-  sleeping-pills(C),  price  (C,  6) 

will  indeed  allow  to  prove  Mary's  suicide.  But  it  means 
that  everyone  will  suicide  when  the  price  of  sleeping  pills 
reaches  the  value  6,  which  is  totally  inelevant  to  the  pre¬ 
ceding  suicide  case. 

In  order  to  avoid  adding  such  ridiculous  rules,  we  define  a 
new  notion  of  distance  between  Pcq  and  Pcq.  Given  two 
possible  completed  proof  trees.  Pci  and  Pci',  we  shall 
collect  the  mismatches  between  Pcq  and  Pci,  on  the  one 
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hand,  and  between  PcQ  and  Pci'  on  the  other  hand.  We 
shall  say  that  Pci  is  closer  to  PcQ  than  Pci’  if  the  num¬ 
ber  of  mismatches  between  Pci  and  Pcq  is  less  than  the 
number  of  mismatches  between  Pci'  and  Pcq.  with  an 
exception  for  zero  mismatches  that  would  drive  us  back 
to  the  first  kind  of  induction.  In  other  words,  we  consider 
the  least  mismatch,  a  complete  matching  being  already 
covered  by  the  first  abduction.When  they  are  equal,  we 
shall  say  that  Pci  is  closer  to  Pcq  than  Pci’  when  the 
conceptual  distance  (supposedly  defined)  between  the 
mismatches  is  less  for  Pci  than  for  Pci'. 

In  our  example,  it  is  clear  that  the  number  of  mismatches 
between  Pcq  and  the  proof  tree  obtained  by  using 
kill(X,X)  sleeping-pills(C),  price(C,  6) 

to  prove  Mary's  suicide  is  very  high. 

We  shall  rather  try  to  use  our  knowledge  about  the  ob¬ 
jects  possessed  by  Mary  to  complete  Ppi.  For  instance, 
completing  it  by  sleeping-pills(obji)  gives  a  proof  tree 
Pci. 

kill(mary,mary)  ^ 

hate(mary,maiy)  possess(mary,objl)  sleeping- 
I  I  pills(objl) 

deprcssed(maiy)  buy(mary,obJl) 

Figure  9.  Pc2.  Sleeping-pills  are  viewed  as  the  cause  of 
Mary's  death. 

We  can  generalize  this  explanation  by  inversion  of  reso¬ 
lution.  The  two  clauses 

kill(X,X)  :-  depressed(X),  buy(X,Q,  shot-gun 

(Q 

kill(mary,mary)  :-  depressed(mary),  buy(mary,obji), 

sleeping-pills(obji) 

are  generalized  by  intraconstruction  and  lead  to  the  three 
clauses 

kill(X,X)  :-  depressed(X),  buy(X,C),  new(C) 

newp(C)  shot-gun(C) 

newp(C)  :-  sleeping-pills(C) 

In  this  example,  the  values  of  newp(c)  arc  a  description  in 
extension  of  the  concept  of  "tool-for-suicide"  which,  ac¬ 
tually,  is  a  poorly  defined  concept.  Besides  guns  and 
sleeping-pills,  it  covers  also  various  cliffs,  the  Eiffel 
tower  etc.  If  we  would  not  have  been  driven  by  the  first 
example,  this  abductive  recovery  would  have  been  done 
with  little  caution.  In  other  wori,  this  kind  of  abductive 
recovery  is  hard  to  perform,  but  we  claim  that  it  is  quite 
necessary,  and,  that  it  finds  a  justification  within  our 
frame. 

5  Recovering  from  plan  failures 


Suppose  that  the  plan  p  of  the  base  problem  amounts  to 
the  application  of  a  sequence  of  operators  (Opi, ..., 
Opn).  The  means  of  the  base  problem  contain  a  set  of  in¬ 
stantiations,  called  here  a,  such  that  a  applied  to  {Opi, 
....  Opn)  leads  to  fulfill  the  goals  of  the  base  problem.  In 
other  words,  one  has  to  prove  that  cr{Opi, ...,  Opn)  does 
not  contradict  the  goals  of  the  base  which  amounts  to 
proving  that  each  Opi  is  such  that  oOpi  does  not  contra¬ 
dict  the  goals  of  the  base,  and  the  post-conditions  of 
oOpi  contain  the  pre-conditions  of  oOpi+i. 

The  means  of  the  target  problem  contain  a  set  of  instanti¬ 
ations,  called  here  o’.  We  propose  to  "compute"  a'  {Opi, 
....  Opn)  and  to  attempt  proving  that  it  does  not  contra¬ 
dict  the  goals  of  the  target  problem.  Unless  we  are  ex¬ 
tremely  lucky,  this  proof  will  fail.  We  then  propose  to 
apply  the  above  abductive  recovery  techniques  in  ordento 
generate  a  new  sequence  of  operators  {Op’i, ...,  Op'p} 
such  that  o'{Op'i, ....  Op'n)  does  not  contradict  the  goals 
of  the  target  problem. 

Section  7  gives  a  detailed  example  on  how  to  achieve 
such  a  recovery.  In  a  few  words,  the  general  strategy  we 
use  is  the  following: 

-  recognize  the  parts  of  the  proof  that  have 
been  succeeding.  If  no  success  at  all  occiu^,  we  fail  to  re¬ 
cover. 

-  delete  the  operators  that  have  been  leading  to 
a  failure.  This  process  introduces  unknown  values  (i.e., 
variables)  in  the  sub-sequence  of  {Opi, ....  Opn)  which 
is  left.  {Op"i, ...,  Op"q)  this  sub-sequence. 

-  prove  that  some  instantiations  a"  corning 

from  the  knowledge  base  can  insure  that  (Op"i . 

Op"p),  together  with  these  instantiations,  does  not  con¬ 
tradict  the  goals  of  the  target. 

-  verify  that  a"  do  not  contradict  o'. 

-  Perform  the  union  of  the  o’d  o'  and  of  o". 
This  union  together  with  {Op"i, ....  Op"(.)  is  a  complete 
solution  to  the  target  problem. 

6  Application  to  Analogy 

In  (Kodratoff,  1990),  we  analyze  the  ways  similarity  and 
causality  can  combine,  and  we  define  the  concept  of  full 
analogies.  We  say  that  an  analogy  is  full  when  the  law 
of  combination  of  causality  and  similarity  is  a  o  p  o  a'^. 
The  symbol  o  represents  the  composition  of  the  substitu¬ 
tions,  i.e.,  the  application  of  to  A',  then  the  applica¬ 
tion  of  p  to  the  result  of  the  last  operation,  and  finally 
the  application  of  a  to  this  last  result  (many  examples 
are  given  below).  In  this  definition,  we  assume  that  a.  = 


^  This  is  the  case  when  none  of  the  subproblems  can  be 
solved.  For  instance,  in  the  case  of  Mary's  suicide  (section 
4),  suppose  we  are  unable  to  prove  also  depressed(inary), 
buy(mary,obji).  Then,  our  attempt  to  prove  ldll(mary,niary) 
is  a  complete  failure  during  which  we  met  no  success  at  all. 
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a'.  Let  us  now  explain  why  this  is  not  a  real  restriction 
to  our  scheme. 

In  our  scheme,  a  is  a  set  of  replacements  allowing  to  ex¬ 
press  how  to  transform  A  into  A',  and  a’  is  a  set  of  re¬ 
placements  allowing  to  express  how  to  transform  B  into 
B'.  It  may  well  happen  that  a  and  a'  do  not  concern  to 
the  same  set  of  replacements.  When  this  is  the  case,  we 
shall  always  consider  that  "a  =  a'"  is  the  global  similar¬ 
ity,  a  u  a',  between  the  whole  base  and  the  whole  tar¬ 
get.  This  hypothesis  would  restrict  the  generality  of  our 
scheme  only  if  the  replacements  could  be  contradictory  in 
some  sense.  In  that  case,  the  analogy  would  have  to  take 
into  account  some  kind  of  contradiction  within  the  base 
itself.  The  first  work  to  be  done  when  setting  up  a 
knowledge  representation  would  then  be  to  make  explicit 
this  contradiction,  and  by  that,  getting  rid  of  it^. 

This  definiUon  can  be  represented  as  follows,  together 
with  our  example. 

BASE  TARGET 


means  of  source 
problem 

abduction  of  a  child 
by  a 

kidnapper 

parents  of  child  are  rich 

a  -  a 


means  of  target 
problem 

abduction  of  a  politician 
by  a  terrorist 
politician  is  famous 


N 

known 
plan 
forbec-1 
oming 
rich  by 
abduct- 
inga  , 
child  ’ 


(abduction  abduction,  kidnapper 
<-  politician,  child's  parents  wealth 
<-  politician's  fame,  get<-  get, 
money  <-  advertisement) 


►a’ 


n 


0  = 


-1 


aopoa  =  P' 


looked 

for 

plan 


t 

^B' 


goal  of  source 
problem 

kidnapper  gets  money 


goal  of  target 
problem 

terrorist  gets  advertisement 


Figure  10.  Applying  our  analogy  scheme  to  the  compari¬ 
son  of  two  kidnapping  cases.  The  base  case  is  the  one  of 
a  child's  kidnapping,  the  target  case  is  the  abduc-tion  of  a 
famous  politician.  The  difficult  substitution  [child's  par¬ 
ents  wealth  <-  politician's  fame]  does  not  need  to  be 
found  in  advance  to  be  able  to  apply  analogy.  Notice  that 


^  This  point  could  be  made  more  precise,  and  be  made  quite 
formal.  This  is  not  our  goal  in  this  paper  which  is  devoted  to 
a  more  inmitive  presentation.  Here,  it  should  be  clear  that  as 
long:' as  a  and  a'  do  not  contradict  each  other,  it  will  be 
simple  to  find  a  similarity  that  includes  the  two  of  them. 


the  similarity  is  a  U  a',  it  contains  the  substitutions  al¬ 
lowing  to  go  from  one  story  to  the  other. 


7  A  detailed  example  of  the  generation 
of  a  new  plan  by  abductive  recovery  of 
the  failure  of  the  old  plan 


We  will  illustrate  this  view  of  analogy  by  using  the 
plans  that  have  been  learned  by  explanation-based  learn¬ 
ing  (EBL)  in  (DeJong  and  Mooney,  1986)  for  kidnapping 
a  rich  person's  child.  Our  aim  will  be  to  transform  by 
analogy  this  plan  in  order  to  apply  it  to  the  case  of  terror¬ 
ists  abducting  a  famous  politician  in  order  to  advertise 
their  political  cause. 

The  means  and  goals  of  the  two  problems  are  given  in 
the  scheme  of  figure  10  which  summarizes  the 
information  contained  in  the  two  problems. 

In  their  paper,  DeJong  and  Mooney  (1986)  show  that  the 
abduction  of  a  child  can  be  represent^  by  the  application 
of  the  two  operators  of  figure  1 1 . 

Operator  1 


Operator  2 

BARGAIN 


persons 

Figure  1 1.  Generalized  operators  for  kidnapping.  The  goal 
of  kidnapping  succeeds  when  the  application  of  OPi  and 
OP2  succ^s.  Each  operation  inside  the  operator  can  be 
also  a  problem  by  itself.  For  instance,  how  person  1 
achieves  to  hold  captive  person2  may  be  a  problem  by  it¬ 
self. 

with  the  instantiations  [person  1  <-  kidnapper,  person2  <- 
child's  parents,  persons  <-  child,  advefiisement(abstracO 
<-  advertisement(concrete)].  These  operators  are  the 
"causality"  P  we  have  been  introducing  in  our  scheme. 
Actually,  they  do  explain  why  the  kidnapper  can  achieve 
his  goal  of  getting  money.  Let  us  attempt  a  full  analogy 
by  computing  p  ©  a"^.  In  this  case,  we  shall  not  use  the 
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similarities  as  shown  on  figure  10,  but  the  one  which 
relevant  to  the  application  of  p,  i.e.,  kidnapper  <- 
terrorist,  ???  <-  child’s  parents,  politician  «-  child, 
money(concrete)  <-  advertisement(concrcte)].  Where  the 
???  express  the  fact  that  we  do  not  know  yet  who  is  go¬ 
ing  to  play  the  role  of  the  child's  parents.  This  amounts 
to  apply  the  instantiations:  [personi  <-  terrorist,  person2 
<-  person2,  persons  <-  politician,  advertisement(abstract) 
<-  advertisement(concrete)]  to  OPi  and  OP2.  Applying 
these  substitutions,  we  find  operators  that  still  contain 
the  variable  person2.  Our  problem  is  now  to  use  the 
background  knowledge  in  order  to  find  an  instance  of  per- 
son2  that  will  not  introduce  contradictions.  Let  us  sup¬ 
pose  that  our  background  knowledge  is  represented  by  the 
following  set  of  clauses. 

1-  WANTS(x,y,z)  IF 

NEEDS(x,2) 

2-  LIKES-MORE-THAN(x.  y.  money)  IF 

LOVES  (x,y) 

3-  LIKES-MORE-TH.AN(x,  y,  money)  IF 

PARENT  (x,  y)  &  NOT-EXCEPTION- 
PARENTAL-RELATION(x,  y) 

4.  LIKES-MORE.THAN(x,y,  money)  IF 

STRONG-RELATION-BETWEEN  (x,  y) 

5-  EXCEPTION-PARENTAL-RELATION(x,y) 

IF 

6-  HOLDS-CAPTIVE(x,  y)  IF 

7-  GIVES-TO(medla,  x,  advertisement)  IF 

GrVES-TO(x,  media,  money) 

8-  GIVES-TO(media,  x,  advertisement)  IF 

GrVES-TO(x,  media,  exciting-news) 

9-  GIVES-TO(x,  y,  exciting-news)  IF 

EVENT(z)  &  KNOWS-OF(x,  z)  & 
RELATIVE-TO(z,  t)  &  FAMOUS(t) 


10-MEDIA(x) 

IF 

TV(x) 

ll-MEDIA(x) 

IF 

RADKXx) 

12-MEDIA(x) 

IF 

NEWSPAPER(x) 

?  :-  POSSESSES(person2. 

advertisement) 

?  :-  LIKES-MORE-THAN(person2, 

politician,  advertisement) 

leads  to  a  failure.  It  follows  that  we  will  be  unable  to 
apply  OPi. 

Conversely,  asking  to  this  knowledge  base  the  questions: 

?  :-  GIVES-TO(person2,x, 

advertisement) 

gives  two  possible  answers  by  using  either  clause  7  or 
clause  8. 


Answeri :  GIVES-TO(x,  media,  money). 
Answeri’:  GIVES-TO{x,  media,  exciting-news). 


One  will  notice  that  Answeri'  itislf  can  lead  to  deeper 
problems  by  using  clause  9.  This  is  not  our  point  here, 
we  just  want  to  find  possible  instantiations  for  OP2,  be¬ 
ing  understood  that  the  application  of  OP2  can  lead  to 
new  problems.  Answeri'  is  such  a  case,  and  clause  9  is 
here  to  show  how  the  next  problems  can  be  solved,  but  is 
not  relevant  to  our  present  problem  of  analogy.  Since  we 
found  the  above  two  above  answers,  and  that  in  both 
cases  the  variable  person2  was  instantiated  by  media, we 
can  claim  that  we  are  able  to  apply  to  the  terrorist  case, 
with  the  substitutions  [personi  <-  terrorist,  person2  <- 
media,  person3  4-politician,  advertisement(abstract)  4- 
adveitisement(concrete)]. 

Since  we  have  already  been  unable  to  answer  the  ques¬ 
tions  ?POSSESSES(person2,  advertisement)  and 
?LIKES-MORE-THAN^rson2,politician,advertisement) 
it  would  be  useless  to  attempt  to  ask  them  again  by  in¬ 
stantiating  person2  by  media.  Our  solution  is  then  to 
delete  from  OPi  the  links  we  have  been  unable  to  prove, 
to  replace  person2  by  media,  and  to  replace  the  links  be¬ 
tween  the  characters  in  by  those  found  as  an  answer  to  the 
application  of  OP2. 1  follows  that,  in  our  example,  the 
scheme  found  by  analogy  will  be:  Apply  either  OPi  or 
OP’l.  as  shown  by  figure  12  below,  and  then  apply  OP2 
with  the  instantiations  [personi  4-  terrorist,  person2  *- 
media,  person3  4-politician,  advertisement(abstract)  4- 
advertisement(concrete)]. 


Operator  1' 


terrorist 

N 

HOLDS-CAPTIV: 


media 


j>Tl^Nv  money 


13- 

Asking  to  this  knowledge  base  the  questions: 


politician 
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Figure  12.  New  operators  obtained  by  deleting  from  OPi 
the  failures,  and  adding  the  success  coming  from  OP2. 

8  Conclusion 

We  have  been  proposing  to  use  a  o  P  «  a**  as  a  combina¬ 
tion  of  the  causality  p  and  of  the  similarity  a  in  order  to 
compute  the  causality  P'  which,  in  turn,  allows  to  com¬ 
pute  the  missing  part  of  the  target  In  simple  cases,  once 
the  knowledge  is  properly  represented,  this  computation 
is  straightforward.  On  the  contrary,  when  dealing  with 
analogous  solutions  to  similar  problems,  it  might  well 
happen  that  the  computation  of  a  o  p  o  a'^  fails,  as  ex- 
empliHed  in  section  6.  In  such  a  case,  we  have  to  recover 
from  the  failure.  The  solution  we  suggest  is  to  use  the 
recently  established  techniques  of  abductive  recovery  (or, 
alternately,  of  inversion  of  resolution)  in  order  to  recover 
and  find  another  plan  able  to  solve  the  target  problem. 
Abductive  recovery  is  not  a  process  easy  to  implement, 
even  though  its  basic  mechanism,  inversion  of  resolution 
has  already  been  implemented  twice  (Muggleton  and 
Buntine,  1988;  Rouveirol  and  Puget,  1989).  The  imple¬ 
mentation  of  the  recovery  process  described  in  (Duval  and 
Kodratoff,  1989)  is  presently  under  way,  and  will  be  used 
in  order  to  recover  form  failures  to  perform  an  analogy. 

In  conclusion,  we  do  believe  that  very  deep  analogies 
similar  to  the  ones  performed  by  humans  are  very  far 
from  the  present  state-of-the-art  of  artificial  intelligence. 
On  the  contrary,  simple  and  robust  analogies  performed 
by  computing  a  o  p  o  a"^  are  easy  to  implement,  and, 
when  the  computation  succeeds,  they  should  become  very 
soon  a  standard  mechanism  for  performing  simple 
changes  allowing  an  adaptive  behavior  to  vision  and 
robotic  systems. 

References 

Carbonell,  J.G.  Derivational  Analogy:  A  Theory  of 
Reconstructive  Problem  Solving  and  Expertise 
Acquisition,  in  R.S.  Michalski,  J.  G.  Carbonell,  T.  M. 
Mitchell  (Eds.),  Machine  Learning:  An  Artificial 
Intelligence  Approach,  Volume  II,  Morgan  Kaufmann 
1986,  pp.  371-392. 

Carbonell,  J.G.  Learning  by  Analogy:  Formulating 
and  Generalizing  Plans  from  Past  Experience,  in  R.S. 
Michalski,  J.  G.  Carbonell,  T.  M.  Mitchell  (Eds.), 


Machine  Learning:  An  Artificial  Intelligence  Approach, 
Morgan  Kaufmann  1983,  pp.  137-159. 

Chouraqui,  E.  Construction  of  a  Model  for  Reasoning 
by  Analogy,  in  Progress  in  Artificial  Intelligence,  L. 
Steels  (Ed.),  Halsted  Press,  New  York,  pp.  169-183, 

1985. 

Cox,  P.T.,  Pietrzykowski,  T.  Causes  for  Events: 
Their  Computation  and  Applications,  Proceedings  of  the 
Eighth  Internatbnal  Conference  on  Automated  Deduction, 
Oxford,  1986,  Lecture  Notes  in  Computer  Science  n° 
230,  Springer  Verlag,  Berlin,  pp.  608-621. 

DeJong  G.F.  &  Mooney  R.J.  Explanation-Based 
Learning:  An  Alternative  View,  Machine  Learning  1, 2, 
pp.145-176, 1986. 

Duval,  B.,  Kodratoff,  Y.  A  Tool  for  the  Management 
of  Incomplete  Theories:  Reasoning  about  explanations,  in 
Machine  Learning,  Meta-Reasoning  and  Logics,  P. 
Brazdil  and  K.  Konolige  (Eds),  Kluwer  Academic  Press 
pp.  135-158, 1989. 

Falkenhainer,  B.,  Forbus,  K.  D.,  Gentner,  D.  "The 
Structure-Mapping  Engine,  Report  N®  UIUCDCS-R-86- 
1275,  DCS,  Univ.  of  Illinois  at  Urbana-Champaign,  May 

1986.  See  also  Proc.  AAAI-86. 

Gentner,  D.,  Analogical  Inference  and  Analogical 
Access,  in  Analogica,  Prieditis  A.  (Ed),  Pitman,  London, 

1988,  pp.  63-88. 

Gentner,  D.,  Structure-Mapping:  A  theoretical 
Framework  for  Analogy,  Cognitive  Science  7,  pp.  155- 
170,  1983. 

Kedar-Cabelli,  S.,  Toward  a  Computational  Model  of 
Purpose-directed  Analogy,  in  Analogica,  Prieditis  A, 
(Ed),  Pitman,  London,  19^,  pp,  89-107. 

Kodratoff,  Y.,  Introduction  to  Machine 
Learning,  Pitman,  I^ndon,  1988. 

Kodratoff,  Y.  Combining  Similarity  and  Causality  in 
Creative  Analogy,  Research  Report  537,  LRI,  Jan.  1990. 

Muggleton,  S.,  Buntine,  R.  Machine  invention  of 
first  order  predicates  by  inverting  resolution.  Proceedings 
of  5th  International  Machine  Learning  Workshop,  pp 
339-352,  Morgan  Kaufmann,  1988. 

Rouveirol,  C.,  Puget  J.F.  A  simple  solution  for 
Inverting  Resolution,  Proceedings  of  the  fourth  European 
Working  Session  on  Learning,  pp  201-211,  Pitman, 

1989. 

Rouveirol,  C.,  Puget  J.  F.  A  Simple  Solution  for 
Inverting  Resolution,  in  Proc,  4th  EWSL,  Morik  K. 
(Ed),  Piunan,  London  1989,  pp.  201-210, 

Russel,  S.  J.,  The  Use  of  Knowledge  in  Analogy  and 
Induction,  Pitamn,  London,  1989. 

Winston,  P,  H.  Learning  new  Principles  from 
Precedents  and  Exercises,  Artificial  htelligence  19, 1982, 
pp.  321-350. 


304  Minton 


Issues  in  the  Design  of  Operator  Composition  Systems 


Steven  Minton 
Sterling  Federal  Systems 
AI  Research  Branch,  Mail  Stop:  244-17 
NASA  Ames  Research  Center 
Moffett  Field,  CA  94035  U.S.A. 


Abstract 

Many  learning  problem  solvers  operate  by 
composing  operator  sequences,  so  that  a 
learned  sequence  can  be  applied  as  a  unit 
during  subsequent  problem  solving.  In  this 
paper  we  will  describe  an  abstract  model  of 
the  operator  composition  and  problem  solv¬ 
ing  processes,  and  use  the  model  to  analyze 
several  design  issues  that  affect  the  utility 
of  the  learning  method.  We  will  focus  pri¬ 
marily  on  design  issues  that  arose  during  the 
implementation  of  the  PRODIGY  system[l6; 

18]  and  two  of  its  predecessors]!?;  15].  The 
purpose  of  this  paper  is  to  consider  these  is¬ 
sues  from  the  common  perspective  offered  by 
our  model,  and  to  summarize  the  relevant  re¬ 
search  in  the  Held. 

1  Introduction 

Many  learning  problem  olvers  employ  the  same  gen¬ 
eral  approach  to  learning  from  experience.  They  learn 
by  composing  rule  sequences[2;  11],  so  that  the  se¬ 
quences  can  be  employed  as  single  units  during  sub¬ 
sequent  problem  solving.  Depending  on  the  particu¬ 
lar  system,  the  composed  sequence  may  be  referred 
to  as  a  macro-operator[5],  a  chunk[l0],  a  heuristic[l9], 
a  search  control  rule[l6],  etc.  Most  of  the  literature 
in  the  field  is  concerned  with  how  these  composed  se¬ 
quences  are  learned.  However,  there  are  also  impor¬ 
tant  design  issues  that  arise  when  one  considers  how 
the  composed  sequences  should  be  stored  and  used 
during  subsequent  problem  solving.  Most  of  these  is¬ 
sues  have  received  little  or  no  attention,  though  in  fact 
they  do  influence  the  utility  of  the  learning  method  and 
the  circumstances  under  which  the  method  is  appro¬ 
priate. 

In  this  paper  we  suggest  that  operator  composition 
has  three  primary  effects  on  problem  solving.  Within 
the  context  of  this  model,  we  discuss  how  a  variety 
of  alternative  methods  for  using  composed  sequences 
affect  the  problem-solving  process.  We  focus  primar¬ 
ily  on  design  issues  that  arose  while  implementing 
three  systems:  PRODIGY[l6;  18],  MORRIS[l7]  and 
the  CBG  learning  game-player[l5].  The  purpose  of 


this  paper  is  to  consider  these  issues  from  the  com¬ 
mon  perspective  offered  by  our  model,  and  to  summa¬ 
rize  the  relevant  research  in  the  field,  rather  than  to 
investigate  any  single  approach  or  issue  in  depth. 

2  Assumptions  and  Terminology 

We  can  categorize  problem-solving  systems  according 
to  various  criteria.  The  first  is  the  domain  specifica¬ 
tion  language.  Some  systems  use  inference  rules  as 
their  basic  unit,  as  do  theorem  provers,  in  which  case 
the  learned  rules  cem  be  stated  as  lemmas.  Others  use 
operators,  as  in  STRIPS[5]  or  PRODIGY.  For  the  pur¬ 
poses  of  this  paper,  we  will  assume  only  that  the  rules 
or  operators  have  explicit  preconditions  and  postcon¬ 
ditions,  and  that  they  are  composable.  We  will  gener- 
ically  refer  to  these  basic  building  blocks  as  primi- 
tive  operators^  and  the  structures  composed  from  them 
as  macros.  We  will  restrict  our  attention  to  macros 
that  represent  simple  operator  sequences,  and  will 
not  consider  disjunctive  or  iterative  macros[22].  The 
preconditions  and  postconditions  of  a  learned  macro 
are  assumed  to  be  exactly  the  the  weakest  precondi¬ 
tions  and  postconditions  of  the  composed  sequence  of 
operators[l6].^ 

For  illustrative  purposes,  the  primitive  operators  we 
will  use  in  our  examples  will  be  simple  STMPS  operap 
tors  with  conjunctive  preconditions  with  variables,  as 
these  are  very  commonly  employed.  (Some  systems, 
such  as  PRODIGY,  use  more  expressive  precondition 
languages  and  where  this  is  relevant  we  will  note  this 
explicitly.)  Our  examples  will  be  taken  from  robot 
problem-solving  domains,  similar  to  the  STRIPS  do¬ 
main,  that  involve  a  single  robot  moving  from  room  to 
room  and  accomplishing  simple  tasks.  Since  the  opera¬ 
tors  and  macros  have  variables,  they  actually  represent 
operator  schemas,  and  we  will  assume  the  variables 
must  be  bound  before  the  operator  can  be  applied. 
We  v/ill  refer  to  an  instance  of  an  operator  v;ith  all  of 
its  variables  bound  to  eonstants  as  an  instaniiation  of 


*  Often  the  primitive  operators  in  a  sequence  can  be 
composed  in  different  ways,  depending  on  how  the  pre¬ 
conditions  and  postconditions  of  the  operators  unify  with 
each  other.  Thus  several  different  macros  may  actually  be 
composed  from  a  given  operator  sequence. 
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that  operator.  In  our  figures,  variables  will  be  indi¬ 
cated  in  italics,  and  constants  in  uppercase. 

A  second  criterion  for  categorizing  problem  solvers 
is  the  search  method.  Most  systems  use  some  variation 
of  either  forward  search  (as  in  production  systems)  or 
backward  search  (as  in  means-end  analysis),  with  ei¬ 
ther  a  depth-first,  breadth-first  or  best-first  ordering 
strategy.  In  general,  the  issues  we  will  discuss  arise 
regardless  of  the  search  method,  although  there  are 
some  peculiarities  that  are  method-specific,  in  which 
case  these  will  be  explicitly  mentioned.  We  assume 
only  that  the  problem  solver  operates  by  successively 
expanding  the  nodes  of  a  search  tree,  so  that  at  each 
node  of  the  tree  some  set  of  primitive  operators  and 
macros  is  considered.  Each  edge  in  the  tree  thus  cor¬ 
responds  to  some  operator  or  macro.  Each  node  in  the 
tree  is  a  state  in  the  search  space. 

3  The  Effects  of  Macro  Learning 

The  purpose  of  macro  learning  is  generally  to  speed 
up  problem  solving.  For  example,  in  [16],  the  utility 
of  a  learned  macro  is  measured  in  terms  of  its  expected 
effect  on  problem-solving  performance,  as  given  by  the 
following: 

Utility  =  (AvrSavingsxApplicFreq)—AvrMatchCost 

AvrMatchCost  is  the  expected  time  cost  of  determin¬ 
ing  whether  the  macro  is  applicable,  ApplicFreq  is  the 
probability  that  the  macro  will  be  applicable  when  it 
is  tested,  and  AvrSavings  is  the  average  time  differ¬ 
ence  in  problem-solving  performance  one  can  expect  if 
the  macro  is  applied,  as  compared  to  that  if  it  is  not 
applied.  Note  that  the  average  savings  can  be  nega¬ 
tive,  since  after  applying  a  macro  the  system  might 
actually  be  farther  from  a  solution  than  before.  In 
general,  the  savings,  match  cost,  and  application  fre¬ 
quency  may  depend  on  complex  factors,  such  as  the 
number  of  other  learned  macros  in  the  system  and  the 
matching  method.  Thus  the  formula  above  gives  the 
utility  of  an  individual  macro,  but  offers  little  insight 
into  how  well  specific  strategies  for  learning  and  using 
macros  will  perform.  For  this,  one  must  consider  the 
overall  effect  of  macro  learning  on  the  search  space. 
In  this  section,  we  suggest  that  macro  learning  can  be 
viewed  as  having  three  primary  effects. 

First,  macro  learning  changes  the  order  in  which 
the  search  space  is  traversed.  Typically,  operator  se¬ 
quences  that  have  previously  succeeded  (i.e.,  those  en¬ 
coded  as  macros)  are  tried  before  other  sequences.  We 
refer  to  this  as  the  reorde-ring  effect  (or  “experiential 
bias”  [17]).^  Figure  1  illustrates  the  reordering  ef¬ 
fect  by  showing  the  search  tree  for  a  simple  depth- 
first,  forward-chaining  problem  solver  before  and  af¬ 
ter  learning.  The  search  space  contains  five  primitive 

^In  some  systems,  search  is  conducted  exclusively  with 
macros  after  a  certain  point,  as  in  Korf’s  system[9],  where 
the  macros  are  guaranteed  to  find  a  solution.  In  any  event, 
sirxe  we  could  theoretically  search  with  the  original  system 
if  the  macros  fml  to  find  a  solution,  we  can  consider  this  as 
an  extreme  case  of  reordering. 


'  ’ 


Figure  1:  Illustration  of  the  reordering  effect  for  a  very 
simple  search  space 


operators,  01,  02,  03,  04  and  05,  and  one  learned 
macro,  M-2-3,  created  from  the  sequence  02,  03.  (For 
simplicity,  the  figure  only  shows  the  top  levels  of  the 
search  tree.)  During  problem  solving  the  system  al¬ 
ways  tries  applying  the  macro  before  any  of  the  origi¬ 
nal  primitive  operators.  In  the  figure,  notice  that  state 
5  is  the  fifth  node  visited  before  learning,  but  it  is  vis¬ 
ited  first  after  learning. 

Another  effect  is  a  change  in  path  cost:  the  cost  of 
reaching  a  state  via  a  macro  may  be  more  or  less  than 
the  cost  of  testing  and  applying  the  corresponding  se¬ 
quence  of  primitive  operators.  Typically,  the  cost  of 
using  a  macro  is  less  than  the  cost  of  using  the  cor¬ 
responding  sequence  of  primitive  operators.  One  rea^ 
son  is  that  there  are  often  fewer  preconditions  and/or 
postconditions  in  the  macro  than  in  the  corresponding 
sequence  of  primitive  operators.  For,  example,  con¬ 
sider  a  macro  composed  from  the  operator  sequence 
GO-TO-OBJ,  PICK-UP-OBJ.  Most  of  the  precondi¬ 
tions  of  the  primitive  operators  are  also  preconditions 
of  the  macro.  However,  the  precondition  (NEXT-TO 
ROBOT  object)  of  PICK-UP,  for  example,  is  not  a 
precondition  of  the  macro  because  it  is  added  by  the 
GO-TO-OBJ  operator. 

A  third  effect  is  that  of  increased  redundancy,  which 
tends  to  degrade  performance.  There  are  two  sources 
of  increased  redundancy.  First,  after  learning  a  set 
of  macros,  there  may  be  many  paths  that  lead  to 
identical  search  states,  since  the  the  same  sequence 
of  primitive  operators  can  be  explored  using  various 
combinations  of  macros  and  primitive  operators.  Fig- 
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Figure  2:  Complete  conversion  of  search  space  into 
macros  illustrates  potential  increase  in  redundancy 


ure  1  illustrates  this  source  of  redundancy,  which  we 
refer  to  as  search  state  redundancy.  If  macro  M-2- 
3  does  not  lead  to  a  solution,  the  corresponding  se¬ 
quence  of  primitive  operators,  02,  03,  will  still  be 
explored  later  in  the  search,  and  thus  State5  (and 
all  of  its  successors)  will  be  visited  twice.  Second, 
the  learned  macros  may  have  duplicate  initial  subse¬ 
quences.  For  example,  a  macro  composed  of  primitive 
operators  02  and  03  will  share  many  of  the  same  pre¬ 
conditions  as  a  macro  composed  of  primitive  operators 
02  and  04.  Thus,  the  preconditions  of  a  primitive  op¬ 
erator,  such  as  02,  may  be  tested  again  and  again 
as  each  of  the  macros  containing  that  operator  are 
tested.  We  refer  to  this  as  path  redundancy.  Figure 
2  shows  the  transformation  produced  by  macro  for¬ 
mation  when  all  operator  sequences  are  converted  into 
macros,  illustrating  the  potential  increase  in  path  re¬ 
dundancy.  According  to  our  model,  increasing  redun¬ 
dancy  is  one  reason  why  a  macro-learning  system  may 
perform  worse  than  a  non-learning  system,  as  has  been 
observed  in  a  variety  of  empirical  studies  (e.g.,  [17;  13; 
20;  14]). 

’.L  liese  three  factors  often  have  conflicting  effects  on 
the  performance  of  a  macro  system.  For  example,  our 
experience  with  the  MORRIS  system  [17],  a  STRIPS- 
like  macro-learning  system,  indicated  that  the  reorder¬ 
ing  effect  had  the  highest  potential  for  improving  per¬ 
formance,  but  that  increased  redundancy  could  easily 
counter  that  if  too  many  macros  were  learned.  Path 
cost  effects  were  much  less  significant  than  the  other 
two  effects. 

We  also  found  that  in  designing  macro-learning  sys¬ 
tems,  including  PRODIGY,  MORRIS  and  the  CBG 
learning  game-playing  system,  many  design  decisions 
had  important  ramifications  with  regard  to  the  inter¬ 
action  of  these  effects  and  the  resulting  performance  of 


the  system.  In  the  following  sections,  we  will  examine 
each  of  these  effects  in  more  deteul,  and  how  the  design 
of  a  system  influences  the  tradeoffs. 

4  The  Reordering  Effect 

To  some  extent,  all  macro-learning  systems  employ 
learned  macros  in  preference  to  the  original  operators. 
Otherwise,  it  would  be  pointless  to  learn  macros.  Ex¬ 
actly  what  it  means  to  “employ”  a  macro  varies  con¬ 
siderably,  however.  In  this  section  we  consider  various 
schemes  for  using  macros  £ind  their  effect  on  the  search 
process. 

4.1  Intermediate  States 

The  first  reordering  issue  we  will  consider  is  whether 
the  macro  is  employed  as  an  indivisible  atomic  unit, 
or  intermediate  states  are  explored.  Consider  a  sim¬ 
ple  forward  search  system,  where  macro  M- 1-2-3  is 
composed  of  individual  operators  01,  02  and  03.  As 
shown  in  figure  3,  if  the  system  discovers  that  the  pre¬ 
conditions  of  macro  M- 1-2-3  are  satisfied  in  the  current 
state,  then  it  can  apply  that  macro,  and  arrive  in  state 
4.  However,  if  the  goal  can  actually  be  achieved  by  just 
applying  operators  01  and  02,  then  the  system  may 
have  “jumped  over”  the  solution,  since  the  goal  may 
not  be  true  in  state  4.  (For  a  concrete  example,  con¬ 
sider  a  macro  for  moving  a  robot  from  one  room  into 
the  next,  and  a  problem  where  the  goal  is  simply  to 
have  the  robot  be  at  the  doorway.)  An  alternative  op¬ 
tion  is,  instead  of  treating  the  macro  as  a  single  atomic 
operator,  to  treat  the  macro  as  a  sequence  of  primi¬ 
tive  operators,  so  that  intermediate  states  2,  3  and  4 
are  visited  in  succession  when  the  macro  is  applied.  A 
simileir  option  is  available  to  backward  chaining  sys¬ 
tems.  When  considering  a  macro  that  achieves  some 
goal,  a  system  can  successively  backchain  on  each  op¬ 
erator  in  the  sequence,  rather  than  treating  the  macro 
as  an  indivisible  operator. 

This  strategy  may  seem  a  bit  odd  at  first,  but  in 
fact,  this  is  essentially  the  way  the  STRIPS  macro 
system  operated.  Given  a  macro  for  solving  a  par¬ 
ticular  goal,  STRIPS  would  scan  the  macro  to  find 
the  shortest  subsequence  whose  preconditions  were 
applicable,  and  thus  avoid  “jumping  over”  solutions. 
More  specifically,  given  a  macro  representing  operator 
sequence  0i,02"0ni  where  0„  achieves  the  current 
goal,  STRIPS  would  first  check  whether  the  precondi¬ 
tions  of  On  matched,  and  if  not,  check  the  precondi¬ 
tions  of  Gn-h  Gnt  &r»d  then  On-2!  On-l)  Out  aiid  so 
on  (using  an  efficient  “planex  scan”  of  the  macro’s  tri¬ 
angle  table)  until  the  entire  macro  had  been  checked. 

More  sophisticated  methods  can  be  used  to  elim¬ 
inate  unnecessary  operators  in  a  macro.  Consider 
a  macro  from  the  STRIPS  robot-world  in  which  the 
robot  moves  from  one  room  into  the  next  by  going  to 
the  door,  opening  it,  and  moving  through  the  door 
way.  Given  a  problem  where  the  door  is  already  open, 
a  simplistic  STRIPS-like  macrc  system  will  backchain 
on  the  preconditions  of  the  macro,  and  have  the  robot 
first  close  the  door,  so  that  the  macro  can  then  be 
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Statel 


Figure  3:  Jumping  over  states  2  and  3  with  macro 
M-1-2-3 


applied.  A  more  sophisticated  system  that  considers 
each  operator  in  turn  can  eliminate  the  OPEN-DOOR 
operator  from  the  plan  since  the  door  is  already  open. 

Unfortunately  there  are  generally  two  drawbacks  to 
checking  intermediate  states.  First,  there  is  little  point 
in  compiling  the  preconditions  and/or  postconditions 
of  the  macro,  since  the  individual  operators  must  each 
be  tested  and  applied  in  order  to  generate  the  interme¬ 
diate  states.  In  other  words,  macros  lose  their  ability 
to  decrease  path  cost.  Second,  redundancy  is  increased 
since  the  intermediate  states  may  be  visited  multiple 
tirhes.  They  will  be  visited  when  any  macros  with 
the  same  intermediate  states  are  considered,  as  well  as 
when  the  original  primitive  opera. ors  are  explored. 

4.2  The  Single  Operator  Approach 

Another  common  approach  for  employing  macros  also 
avoids  the  problem  of  jumping  over  intermediate 
states.  This  strategy  is  used  by  the  PRODIGY  sys¬ 
tem  when  it  learns  by  observing  successful  opera¬ 
tor  sequences[l6;  18].  PRODIGY’S  explanation-based 
method^  for  learning  from  success  is  similar  to  macro¬ 
operator  learning.  (PRODIGY  can  also  learn  from 
problem-solving  failures  and  goal  interactions,  how¬ 
ever,  we  will  focus  only  on  learning  from  success,  as 
it  is  most  similar  to  traditional  macro-operator  learn¬ 
ing.)  To  learn  from  a  successful  operator  sequence, 
the  system  analyses  the  operator  sequence  in  order 
to  identify  the  sequence’s  preconditions.  The  result¬ 
ing  rule,  called  a  search  control  rule,  is  very  similar 
to  a  traditional  macro-operator,  but  it  is  used  differ¬ 
ently.  If  a  control  rule  is  learned  from  an  operator 

^prodigy’s  search  control  rules  are  learned  via 
explanation-based  learning.  The  rules  can  be  used  to  se¬ 
lect,  reject,  or  indicate  preferences  for  operators,  goals  or 
bindings.  In  this  paper  we  restrict  the  discussion  to  learn¬ 
ing  from  successful  operator  sequences,  which  produces 
operator  preference  rules  only.  Although  operator  pref¬ 
erence  rules  are  used  differently  than  macros,  the  EBL 
method  is  essentially  identical  to  macro  learning,  and  in¬ 
deed,  prodigy’s  EBL  method  for  learning  from  success¬ 
ful  operator  sequences  has  also  been  used  to  produce  tra¬ 
ditional  macros  rather  than  search  control  rulesflG]. 


sequence,  then  it  indicates  a  preference  for  the  last 
operator  in  the  sequence.  (The  last  operator  is  cho¬ 
sen  because  PRODIGY  employs  means-ends  analysis, 
a  form  of  backward  chaining).  For  example,  consider  a 
search  control  rule  learned  by  the  success  of  operator 
sequence  01,  02,  03  in  solving  a  goal  G.  Although  the 
preconditions  of  the  control  rule  are  the  preconditions 
of  the  entire  sequence,  the  control  rule  will  recommend 
only  that  03  be  chosen  to  solve  goal  G.  The  obvious 
disadvantage  of  this  strategy  is  that  it  can  be  expen¬ 
sive.  After  03  is  chosen,  02  and  01  still  remain  to 
be  selected  after  the  system  backchains  on  03.  Thus 
multiple  control  rules  are  needed  to  exactly  duplicate 
the  action  of  a  traditional  macro.  However,  the  the¬ 
ory  behind  this  strategy  assumes  that  at  most  deci¬ 
sion  points,  control  rules  are  unnecessary  -  the  correct 
choice  will  be  made  through  the  use  of  the  system’s  de¬ 
fault  heuristics.  Control  rules  are  only  needed  in  spe¬ 
cific  circumstances  where  the  wrong  operator  would  be 
chosen,  and  a  costly  misteike  would  be  made. 

One  advantage  of  this  “single  operator”  strategy  is 
that  it  allows  control  information  to  be  brought  to  bear 
at  intermediate  states.  Consider  a  situation  where  two 
macros  exist,  MA  and  MB,  and  a  problem  that  in¬ 
volves  multiple  goals.  Assume  the  solution  requires 
interleaving  macros  MA  and  MB,  e.g.,  applying  an  ini¬ 
tial  subsequence  of  MA,  then  applying  MB,  and  finally 
finishing  MA.  If  the  system  could  only  apply  MA  as 
a  unit,  it  would  never  even  notice  that  MB  was  appli¬ 
cable  at  an  intermediate  state.  To  be  more  concrete, 
macro  MA  might  be  “driving  home  from  work”,  and 
macro  MB  might  be  “going  to  the  bank  to  get  money”. 
Given  the  two  goals  of  getting  home  and  having  money, 
one  would  want  to  consider  the  option  of  stopping  at 
the  bank  on  the  way  home  from  work.  Single  opera- 
tor  learning  enables  the  system  to  take  shorter  paths 
through  the  state  space  when  they  arise  serendipi- 
tously  because  the  problem  solver  still  plans  one  oper¬ 
ator  at  a  time.  In  the  traditional  macro  approach,  the 
problem  solver  is  locked  into  pre-established  sequences 
of  operators. 

A  second  advantage  of  the  single  operator  approach 
is  that  it  enables  multiple  rules  to  be  combined  to¬ 
gether  into  simpler  rules.  For  example,  consider  a  sys¬ 
tem  that  has  learned  two  ipacros,  one  for  going  to  the 
car  and  opening  the  door  (assuming  it  is  unlocked)  and 
another  for  going  to  the  car,  unlocking  it,  and  opening 
it.  These  macros  can  be  combined  into  an  efficient  set 
of  single-operator  rules,  one  for  each  step  of  the  plan, 
rather  than  having  a  separate  macro  for  each  possible 
operator  sequence.  Several  variations  of  the  single  op¬ 
erator  approach  have  been  developed,  such  as  LEX2’s 
use  of  problem-solving  heuristiesflO],  and  PET’s  use 
of  episodes  [21]  (loosely  packaged  heuristic  rules  for 
operator  selection).  A  variety  of  methods  have  been 
suggested  for  combining  multiple  rules,  such  as  sim¬ 
plification  (in  PRODIGY)  and  induction  (in  LEX2). 
In-depth  comparisons  of  these  approaches  have  yet  to 
be  carried  out. 
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4.3  Backchaining  on  Macros 

Another  reordering  issue  (related  to  our  discussion  of 
intermediate  states)  arises  solely  in  connection  with 
backward  chaining  systems.  This  is  the  question  of 
whether  a  system  should  backchain  on  the  precondi¬ 
tions  of  macros  in  addition  to  backchaining  on  the  pre¬ 
conditions  of  primitive  operators.  In  other  words,  if 
there  exists  a  macro  that  can  achieve  a  current  goal 
(or  subgoal),  but  its  preconditions  are  not  currently 
satisfied,  should  a  system  backchain  on  the  unmatched 
preconditions  of  the  macro?  A  variety  of  answers  are 
possible.  One  strategy,  used  in  PRODIGY,'*  is  not 
to  beickchain  on  macros  at  all;  if  the  preconditions 
of  the  macro  are  not  applicable,  then  the  macro  is 
not  used.  A  second,  less  restrictive  strategy,  used  in 
STRIPS,  is  to  backchain  on  the  unmatched  precondi¬ 
tions  of  the  macro.  A  third  strategy,  even  less  restric¬ 
tive,  is  to  backchain  on  the  unmatched  preconditions 
of  the  “best”  tail  subsequence  of  a  macro.  (Given 
a  macro  composed  of  operators  0i,02...0n,  &  tail 
subsequence  is  any  subsequence  0fc,0*+i...0„,  where 
k  <  n.)  “Best”  might  be  defined  as  the  sequence  with 
the  fewest  unmatched  preconditions,  for  instance. 

Empirical  evidence  from  a  study  by  Mooney[20l  in¬ 
dicates  that  the  strategy  of  backchaining  on  macros 
may  be  counterproductive,  in  that  it  will  decrease 
utility.  Mooney’s  study  compared  a  system  that 
backchained  on  macros  to  an  extremely  strict  sys¬ 
tem  that  applied  a  macro  only  if  it  could  solve  the 
entire  problem  immediately  (from  the  initial  state). 
The  second  system  tended  to  perform  better.  Un¬ 
fortunately.  Mooney’s  study  did  not  consider  inter¬ 
mediate  points  in  the  design  space  (such  as  a  system 
that  could  apply  macros  to  solve  subgoals,  but  which 
would  not  backchain  on  the  macros)  and  so  the  is¬ 
sue  of  backchaining  on  macros  still  has  not  been  com¬ 
pletely  addressed.  Experiments  (unpublished)  with 
the  MORRIS  system  indicated  that  backchaining  on 
macros  can,  by  itself,  reduce  utility,  which  is  consis¬ 
tent  with  Mooney’s  results. 

Using  our  seeirch  space  model  it  is  straightforward 
to  explain  why  backchaining  pn  macros  may  be  coun¬ 
terproductive.  Consider  a  node  in  the  search  tree, 
as  shown  in  figure  4.  There  will  be  a  set  of  relevant 
macros  at  that  node.  (In  a  backward-chaining  system, 
a  rule  or  macro  is  relevant  at  a  node  if  one  or  more  of  its 
postconditions  unifies  with  one  or  more  goals  at  that 
nodt.)  Macros  formed  from  sinular  subsequences,  such 
as  M-1-2-3  and  M-2-3-4  in  the  figure,  will  contain  many 
identical  preconditions,  such  as  P3.  Each  precondition 
is  a  potential  subgoal.  For  each  subgoal,  there  will  be 
subtree  (in  the  search  tree)  that  must  be  explored  in 
order  to  achieve  that  subgoal  (or  determine  that  it  is 
unachievable).  Thus,  for  each  identical  subgoal,  there 
will  be  identical  subtrees.  In  the  figure,  precondition 
P3  is  an  unmatched  precondition  of  both  macros,  and 
therefore  becomes  a  subgoal  for  both  macros.  Thus, 
this  situation  is  one  version  of  the  redundancy  prob- 


*  Prodigy  can  learn  both  traditional  macros  and  search 
control  rules,  neither  of  which  is  backchained  on. 


Top-level  goal 


Figure  4:  How  backchaining  on  macros  leads  to  redun¬ 
dancy 


lem  described  in  section  3,  since  the  same  work  will  be 
repeated  in  different  paths  in  the  search  space. 

4.4  Interactions  Between  Search  Strategy 
and  Macro  Usage 

The  last  ordering  issue  we  will  consider  in  this  section 
concerns  the  interaction  of  macro  learning  and  depth- 
first  search.  For  efficiency,  many  problem  solvers  use 
some  variation  of  depth-first  search.  Unfortunately, 
macro  learning  can  give  a  depth-first  search  a  more 
breadth-first  flavor,  as  the  following  example  illus¬ 
trates  [16;  24].  Figure  5  shows  a  graph  representing 
seven  connected  rooms  in  a  house.  Consider  a  sim¬ 
ple  operator  for  moving  a  robot  from  room  to  room: 
to  move  from  room-x  to  room-y,  the  rooms  must  be 
directly  connected  (i.e.,  there  must  be  an  edge  co’..- 
necting  them  on  the  graph)  and  the  robot  must  be 
in  room-x.  Given  a  problem  where  the  robot  must 
move  to  room  4  from  room  1,  a  solution  is  to  move 
from  room  1,  through  rooms  2  and  3,  and  into  room 
4.  A  depth-first  traversal  of  the  search  space  will,  in 
effect,  explore  the  graph  depth-first,  which  in  this  case 
is  a  very  efficient  way  to  find  a  path.  The  macro  that 
is  learned  from  this  solution  is  shown  in  figure  5.  It 
states  that  if  there  is  a  sequence  of  four  nnected 
rooms,  then  the  robot  can  move  from  the  fir^t  room 
to  the  last.  Unfortunately,  when  using  such  a  macro, 
most  problem  solvers  will  find  off  paths  between  rooms 
1  and  4,  which  is  considerably  more  expensive  than 
finding  a  single  path.  This  difficulty  is  due  to  the  use 
of  matching  algorithms  such  as  RETE  [6],  which  at¬ 
tempt  to  find  all  matches  to  the  preconditions  of  a 
macro.  Thus,  what  was  originally  an  efficient  depth- 
first  search  is  converted  into  a  search  where  all  paths 
up  to  length  four  will  be  explored  (by  the  matcher). 

One  might  argue  that  this  problem  can  be  solved 
simply  by  using  a  matcher  that  stops  once  a  single 
successful  instantiation  of  a  macro  is  found.  Unfortu¬ 
nately,  the  answer  is  not  so  simple.  In  a  STRIPS-like 
macro  system,  for  instance,  a  macro  will  rarely  solve 
an  entire  problem,  but  will  only  solve  one  or  more  sub- 
gords.  Thus,  the  first  instantiation  found  may  solve  a 
subgoal,  but  after  applying  the  instantiated  macro,  it 
may  not  be  possible  to  find  a  global  solution.  Alter- 
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Figure  5:  Room  connections,  and  a  macro  for  moving  between  rooms 


native  instantiations  must  then  be  explored.  Consider 
a  slightly  more  complex  domain  in  which  some  of  the 
rooms  have  wet  floors.  In  this  domain  the  move  op¬ 
erator  is  augmented  so  that  moving  into  a  room  with 
wet  floors  will  make  its  wheels  wet.  Consider  a  prob¬ 
lem  where  the  robot  must  be  moved  from  room  1  to 
room  4  so  that  it  can  accomplish  some  task  in  room 

4  (e.g.,  moving  a  box),  but  now  room  2  is  wet.  Un¬ 
fortunately,  after  taking  the  first  path  found  to  room4 
(though  rooms  2  and  3)  the  problem  solver  may  dis¬ 
cover  that  the  robot  cannot  accomplish  its  task  be¬ 
cause  it  lacks  traction  due  to  its  wheels  being  wet. 
The  problem  solver  must  then  backtrack  and  consider 
alternative  paths  to  room  4,  until  it  finds  one  that 
leads  through  dry  rooms,  so  that  a  complete  solution 
can  be  generated.  Thus,  an  efficient  macro  problem 
solver  should  stop  matching  as  soon  as  an  appropriate 
instantiation  is  found,  but  be  capable  of  resuming  the 
matching  process  if  that  instantiation  does  not  lead  to 
a  solution. 

Finally,  we  note  that  this  issue  is  even  more  prob¬ 
lematic  for  problem  solvers,  such  as  MORRIS,  that 
use  heuristic  evaluation  to  determine  which  operator 
or  macro  to  select.  Such  problem  solvers  normally  at¬ 
tempt  to  find  all  instantiations  of  relevant  operators 
at  each  decision  point  so  that  they  can  be  evaluated. 
In  other  words,  the  system  must  find  all  instantiations 
of  all  operators  and  macros,  and  then  choose  the  best 
alternative.  This  exacerbates  the  matching  problem, 
since  one  cannot  use  the  method  described  above  in 
which  matching  terminates  after  a  single  instantiation 
is  found.  In  such  systems,  macro  learning  can  seri¬ 
ously  distort  the  depth-first  flavor  of  the  search,  since 
the  matching  process  cannot  proceed  in  a  depth-first 
manner. 

5  Decreasing  Path  Cost 

Once  macro  has  been  learned,  one  can  “compile” 
the  ma^ro  by  reformulating,  rearranging  or  indexing 
its  preconditions  and/or  postconditions  in  order  to  de¬ 
crease  the  cost  of  using  the  macro.  Compilation  can 
reduce  the  cost  of  testing  and  applying  a  macro  sig¬ 
nificantly  when  compared  to  the  cost  of  successively 
testing  and  applying  the  corresponding  sequence  of 
primitive  operators. 

Much  of  the  work  in  this  area  has  concentrated 
on  reducing  the  cost  of  matching  the  preconditions 


of  macros,  because  precondition  matching  is  typically 
extremely  expensive.  In  fact,  precondition  .matching 
is  NP-complete,  assuming  the  preconditions  are  each 
existentially  quantified  conjuncts  [7;  16].  (Precondi¬ 
tion  matching  is  even  more  expensive  if  the  precondi¬ 
tion  language  allows  arbitrary  existential  and  univer¬ 
sal  quantification;  in  this  case  the  task  is  P-space  com¬ 
plete.)  Many  stand^lrd  matching  algorithms,  including 
the  matching  algorithm  used  in  PRODIG  Y[1 6]  and  the 
RETE  matching  algorithm  [6]  used  by  SOAR[lO],  run 
in  time  O(s^)  in  the  worst  case,  where  s  is  the  number 
of  conditions  that  are  true  in  the  state,  and  p  is  the 
number  of  preconditions  in  the  meicro. 

The  cost  of  matching  is  particularly  important  for 
macro  systems,  because  macros  can  have  many  pre¬ 
conditions.  If  a  macro  is  created  from  a  primitive  op¬ 
erator  sequence  Oi,  Oz—Qm  then  every  precondition 
of  every  primitive  operator  in  this  sequence  will  also 
be  a  precondition  of  the  resulting  macro,  except  for 
preconditions  that  are  guaranteed  to  be  true  due  to 
the  action  of  earlier  operators  in  the  sequence.  (A 
precondition  of  Ok  is  guaranteed  to  be  true  if  an  ear¬ 
lier  operator  Oj,  j  <  ife,  either  tests  or  adds  that  same 
precondition,  and  no  operator  between  Oj  and  Ok  can 
delete  that  condition.)  Thus  the  number  of  precondi¬ 
tions  in  a  macro  will  normally  be  less  than  the  sum 
of  the  preconditions  of  the  primitive  operators  from 
which  it  is  composed,  but  the  difference  is  often  rel¬ 
atively  small.  In  general,  a  macro  may  have  up  to 
(P  —  1)  X  N  preconditions,  where  N  is  the  length  of 
the  operator  sequence,  and  P  bounds  the  number  of 
preconditions  in  an  operator.  This  assumes  that  each 
operator  in  the  original  sequence,  except  for  the  last, 
makes  at  least  one  precondition  of  a  subsequent  operet- 
tor  true.®  Therefore  methods  for  reducing  the  match¬ 
ing  cost  for  macros  can  be  quite  important.  A  great 
deal  of  attention  was  foonsed  on  this  in  the  PRODIGY 


®Thi8  formula  also  ab.:nme8  that  there  is  a  single  goal 
that  the  macro  is  designed  to  make  true,  and  that  the 
macro’s  preconditions  arc  formed  solely  from  the  precondi¬ 
tions  of  the  primitive  operators  in  the  sequence.  (In  some 
systems,  extra  preconditions  must  be  inserted  in  the  macro 
in  order  to  preserve  correctness.  For  example,  in  STRIPS- 
like  systems,  a  macro  may  contain  additional  preconditions 
such  as  (NOT-EQUAL  a  6),  which  specifics  that  variables 
a  and  b  cannot  have  the  same  value,  in  order  to  guarantee 
that  the  postconditions  of  the  macro  operate  properly.) 
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system,  where  the  process  of  reducing  match  cost  was 
referred  to  as  “compression”. 

The  most  obvious  technique  for  reducing  the  cost 
of  testing  a  list  of  preconditions  is  to  order  the  pre¬ 
conditions  in  an  intelligent  meinner.  As  discussed  in 
[16],  if  the  preconditions  of  a  macro  are  simply  or¬ 
dered  so  that  they  will  be  tested  in  the  same  order  as 
in  the  corresponding  conditions  in  the  primitive  rule 
sequence,  then  the  cost  of  testing  the  preconditions 
of  the  macro  will  approximate  the  cost  of  searching 
in  the  original  search  space.^  Many  macro-learning 
systems,  including  MORRIS,  PRODIGY  and  SOAR, 
employ  heuristic  techniques  to  generate  a  better  or¬ 
der.  Optimal  precondition  ordering  is  generally  not 
guaranteed,  as  it  requires  knowledge  about  the  do- 
mrun  which  is  typically  unavailable;  for  example,  it 
is  necessary  to  know  the  probability  that  a  condi¬ 
tion  will  match,  given  that  other  conditions  have,  or 
have  not,  successfully  matched.  Although  precondi¬ 
tion  ordering  techniques  are  commonly  used  to  im¬ 
prove  matching  cost,  the  problem  has  received  very 
little  attention  in  the  AI  literature  as  compared  to 
the  well-studied  problem  of  finding  good  subgoal  or¬ 
derings  for  problem  solving  and  inference  (e.g.,  [4; 
23]).  However,  the  problem  of  optimizing  precondition 
ordering  for  matching  has  received  significant  atten¬ 
tion  in  the  database  literature,  where  it  is  considered 
a  part  of  conjunctive  query  optimization[25]. 

In  addition  to  ordering  the  preconditions  of  a  macro, 
it  is  often  possible  to  simplify  the  preconditions, 
thereby  reducing  their  match  cost.  For  example,  table 
t  shows  the  preconditions  for  a  macro  composed  of  op¬ 
erators  GO-TO-OBJ,  PICKUP-OBJ,  GO-TO-DOOR, 
PUTDOWN-NBXT-TO,  which  is  learned  from  a  prob¬ 
lem  where  the  robot  must  position  an  object  next  to 
the  door.  (The  operator  definitions  are  taken  from 
[16]).  Next  to  each  precondition  is  shown  the  opera¬ 
tor  from  which  it  came.  These  preconditions  can  be 
simplified  considerably,  as  shown  in  the  table.  No¬ 
tice  that  the  precondition  (IS-OBJECT  a:),  for  in- 
st^ance,  is  unnecessary  and  can  be  removed,  because 
(IS-CARRIABLE  x)  is  also  a  precondition,  and  any¬ 
thing  that  is  carriable  is  an  object.  Also  notice  that 
if  (IN-ROOM  ROBOT  ry)  is  unnecessary  given  that 
(IN-ROOM  ROBOT  rx)  is  true,  since  the  robot  can 
only  be  in  one  room  at  a  time.^  As  even  this  simple 
example  illustrates,  simplification  and  reordering  are 
most  useful  when  the  operators  in  a  macro  constrain 
each  other.  In  our  example,  the  fact  that  the  object 


”  We  note  that  the  cost  of  matching  a  list  of  precondi¬ 
tions  depends  not  only  on  the  number  of  preconditions,  but 
also  on  the  number  of  bindings  generated  for  the  variables 
in  the  preconditions.  However,  as  described  in  [16],  the 
number  of  bindings  generated  when  matching  a  macro’s 
preconditions  can  be  expected  to  be  the  same  as  the  num¬ 
ber  of  bindings  generated  when  matching  the  correspond¬ 
ing  operator  sequence,  assuming  that  the  conditions  are 
tested  in  the  same  order. 

^If  the  robot  could  be  in  more  than  one  room  at  a  time, 
these  preconditions  would  not  be  redundant,  since  rz  and 
ry  have  different  restrictions. 


BEFORE: 

(IS-OBJECT  z) 
(IN-ROOM  X  rx) 
IN-ROOM  ROBOT  rx) 
ARMEMPTY) 
(CARRIABLE  z) 
IN-ROOM  ROBOT  ry) 
IS-DOOR  dr) 
(DRtTO-RM  dr  ry) 
IN-ROOM  ROBOT  rz) 
IS-OBJECT  dr) 
(IN-ROOM  ry  dr)) 

AFTER: 

(IN-ROOM  ROBOT  rz) 
IN-ROOM  z  rz) 
CARRIABLE  z) 
(ARMEMPTY) 
(DRtTO-RM  dr  rx)) 


from  GO-TO-OBJ 
from  GO-TO-OBJ 
from  GO-TO-OBJ 
from  PICKUP-OBJ 
from  PICKUP-OBJ 
from  GO-TO-DOOR 
from  GO-TO-DOOR 
from  GO-TO-DOOR 
from  PUTDN-NXT-TO 
from  PUTDN-NXT-TO 
from  PUTDN-NXT-TO 


Table  1:  The  preconditions  of  a  macro,  before  and 
after  simplification  and  reordering 


must  be  carriable  and  in  the  same  room  as  the  robot 
may  greatly  constrain  the  choice  of  object.  Thus  ap¬ 
plying  this  macro  may  be  significantly  more  efficient 
than  problem  solving.  For  instance,  consider  a  forward 
search  problem  solver.  When  applying  GO-TO-OBJ 
the  system  must  select  an  object  to  go  to,  and  an  arbi- 
trary  object  in  the  room  will  be  chosen;  it  is  only  when 
the  system  attempts  to  apply  PICK-UP-OBJ  that  the 
CARRIABLE  constraint  is  encountered,  in  which  case 
backtracking  will  be  necessary  if  the  object  cannot  be 
carried.  In  contrast,  after  learning  the  macro,  the  pre¬ 
conditions  can  be  ordered  so  that  the  choice  of  an  ob¬ 
ject  is  immediately  determined  by  the  CARRIABLE 
constraint.  (If  the  preconditions  are  not  ordered  ap¬ 
propriately  then  inefficiencies  will  result.  For  instance, 
if  the  preconditions  are  ordered  in  the  same  order  that 
the  problem  solver  encounters  them,  then  the  matcher 
can  recreate  the  same  binding  mistake  as  the  problem 
solver!  This  subject  is  discussed  at  length  in  [16].) 

There  are  various  techniques  for  simplifying  precon¬ 
dition  expressions,  depending  on  how  much  knowl¬ 
edge  is  available.  If  arbitrary  inference  rules  are  avail¬ 
able  to  the  simplifier,  then  producing  the  least  expen¬ 
sive  precondition  expression  requires  theorem-proving 
[16],  Unfortunately,  theorem-proving  is  undecidable. 
Less  expensive  simplification  techniques  include  par¬ 
tial  evaluation,  and  the  application  of  heuristic  trans¬ 
formations.  All  three  of  these  techniques  were  ap¬ 
plied  by  prodigy’s  compression  module,  although 
the  theorem-prover  ’.vas  extremely  restricted. 

Other  techniques  for  reformulating  preconditions 
are  also  available.  For  example.  Chase  et  al.  [3]  have 
described  a  technique  that  takes  a  precondition  ex¬ 
pression  and  drops  conjuncts  that  are  expensive  to 
evaluate.  The  purpose  is  to  find  an  approximation  to 
the  original  expression  that  is  less  expensive  to  match. 
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Current  results  indicate  that  their  technique  leads  to 
only  minor  improvements  in  efficiency,  but  the  ap¬ 
proach  appears  promising.  Techniques  for  indexing 
macros,  which  effectively  reduce  the  average  time  to 
check  whether  a  macro  is  applicable,  are  also  an  area 
of  current  interest  (e.g.,  [l]).  Indexing  combines  as¬ 
pects  of  precondition  reordering  (in  that  the  most  im¬ 
portant  conditions  are  examined  first)  and  structure 
sharing  (in  that  many  macros  can  share  indices)  to 
improve  efficiency. 

6  Eliminating  Redundancy 

A  significant  problem  with  using  macro-operators  is 
that  they  introduce  redundancy.  As  described  ear¬ 
lier,  macros  introduce  redundancy  in  two  ways.  First, 
states  may  be  searched  repeatedly.  Secondly,  macros 
which  represent  common  subpaths  may  have  duplicate 
substructure. 

One  way  to  reduce  the  first  type  of  redundancy  is 
simply  to  record  each  search  state  so  that  it  is  not  vis¬ 
ited  more  than  once.  This  approach  was  investigated 
by  Markovitch  and  Scott[l4],  who  pointed  out  some 
obvious  problems.  First,  the  bookkeeping  costs  are 
high,  especially  the  space  cost.  Secondly,  the  problem 
is  not  completely  eliminated,  since  the  problem  solver 
would  still  visit  some  states  more  than  once  (although 
it  would  recognize  that  such  a  state  had  already  been 
visited,  and  discontinue  search  along  that  path).  Nev¬ 
ertheless,  Markovitch  and  Scott  did  implement  this  so¬ 
lution,  and  found  that  performance  of  their  system  was 
improved  by  a  factor  of  two. 

A  method  for  reducing  the  second  type  of  redun¬ 
dancy  is  to  employ  a  scheme  for  sharing  conunon  sub¬ 
structure  within  macros.  For  example,  if  two  macros 
share  several  preconditions,  we  can  separate  them  into 
three  structures,  one  intermediate  structure  with  the 
shared  preconditions,  and  two  others  each  with  the  re¬ 
maning  conditions.  This  way  the  shared  preconditions 
in  the  intermediate  structure  need  only  be  matched 
once.  Wogolus  and  Langley  [26]  have  suggested  such  a 
scheme  for  creating  intermediate  concepts  from  an  ini¬ 
tial  domain  theory.  Such  techniques  would  presumably 
be  even  more  useful  once  macros  have  been  learned, 
due  to  the  increase  in  redundancy. 

A  limited  form  of  substructure  sharing  is  automati¬ 
cally  implemented  within  the  RETE  match  network[6], 
upon  which  the  SOAR  system  is  built.  If  SOAR  con¬ 
tains  a  chunk  with  preconditions  A,  B,  C,  D,  E  and 
another  chunk  with  preconditions  A,  B,  C,  F,  G,  H, 
then  the  network  will  automatically  create  a  shared 
structure  for  matching  conditions  A,B  and  C.  Unfor¬ 
tunately,  if  the  second  chunk  is  H,  A,  B,  C,  D,  the 
network  will  not  create  shared  substructure,  since  it 
will  only  do  so  if  it  can  find  an  identical  initial  subse¬ 
quence. 

More  complex  methods  for  sharing  substructure  are 
of  course  pocsible,  but  little  research  has  been  done. 
Methods  for  maximizing  shared  substructure  may  be 
quite  complex.  Moreover,  unfortunately,  such  meth¬ 
ods  generally  conflict  with  compilation  schemes  for  re¬ 
ducing  path  cost  (such  as  those  discussed  in  the  pre¬ 


vious  section).  For  example,  maximizing  substructure 
sharing  may  conflict  with  optimally  ordering  the  con¬ 
ditions  in  a  macro.  Thus,  if  we  have  two  macros  with 
preconditions  A,  B,  C,  H,  and  preconditions  A,  B,  C,  J, 
it  may  be  that,  individually,  the  best  ordering  of  their 
preconditions  is  A,  H,  C,  B  and  A,  B,  J,  C,  respectively, 
which  reduces  the  amount  of  sharing  that  is  possible. 
Similarly,  methods  for  simplifying  the  preconditions  of 
macros  may  destroy  possibilities  for  sheiring. 

Interestingly,  sharing  of  preconditions  via  the  cre¬ 
ation  of  intermediate  concepts  tends  to  recreate  the 
structure  of  the  original  search  space.  This  can  be  seen 
in  figure  2.  Whereas  complete  macro  formation  will 
convert  the  space  shown  on  the  top  of  the  figure  into 
the  space  shown  on  the  bottom  the  figure,  the  creation 
of  intermediate  concepts  tends  to  work  in  exactly  the 
opposite  direction,  i.e.,  it  will  convert  the  space  shown 
on  the  bottom  of  the  figure  into  the  space  shown  on  the 
top  of  the  figure.  For  this  reason,  Wogulis  and  Lan¬ 
gley  describe  their  method  for  creating  intermediate 
concepts  as  the  inverse  of  explanation-based  learning. 
However,  when  both  techniques  are  used  in  conjunc¬ 
tion,  in  a  discriminating  way,  the  advantages  of  both 
can  be  combined  (as  was  found  with  the  CBG  game- 
piayer[l5]  where  structure  sharing  was  implemented 
by  hand). 

The  most  intensively  investigated  technique  for  re¬ 
ducing  redundancy  is  to  limit  the  number  of  macros 
that  are  used  by  the  system  [17;  16;  13;  27;  8].  (As 
this  subject  has  received  considerable  attention  else¬ 
where,  we  will  only  briefly  touch  upon  it.)  There 
are  a  variety  of  methods  for  selecting  the  most  useful 
macros.  Frequency  of  use,  heuristic  informativeness, 
and  a  variety  of  other  utility  metrics  have  all  been  in¬ 
vestigated.  Markovitch  and  Scott[l2]  introduce  a  tax¬ 
onomy  of  methods  for  limiting  macro  use.  Depend¬ 
ing  on  when  the  “filtering”  process  takes  place,  they 
distinguish  between  selective  experience,  selective  at¬ 
tention,  selective  acquisition,  selective  retention,  and 
selective  utilization.  These  methods  have  a  common 
purpose  -  by  restricting  a  system’s  consideration  to 
the  most  useful  macros,  not  only  is  redundancy  lim¬ 
ited,  but  the  ordering  bias  is  improved  so  that  the  most 
useful  paths  in  the  search  space  are  explored  first. 

7  Conclusion 

This  paper  has  introduced  a  model  that  suggests  that 
macro  learning  has  three  primary  effects;  the  search 
space  is  reordered,  the  path  cost  of  reaching  particu¬ 
lar  states  may  change,  and  redundancy  may  increase. 
The  model  represents  a  first  step  towards  a  predictive 
theory  of  the  utility  of  macro  learning.  Previous  work 
has  indicated  that  the  utility  of  macro  learning  is  high 
when  a  small  set  of  macros  can  be  learned  that  are  suf¬ 
ficient  for  solving  most  problems.  In  this  case,  accord¬ 
ing  to  our  model,  the  positive  effect  of  reordering  is 
maximized,  while  the  negative  effect  of  redundancy  is 
minimized.  In  this  paper,  we  have  examined  a  variety 
of  more  detailed  issues  in  the  design  of  macro  systems, 
and  the  complex  tradeoffs  that  determine  the  utility 
of  macro  learning.  Eventually,  we  hope  that  through 


312  Minton 


a  combination  of  additional  empirical  studies,  and  re¬ 
finements  to  the  model,  we  will  be  able  to  validate  the 
model  and  make  quantitative  predictions  concerning 
the  performance  of  macro-learning  techniques. 
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Abstract 

This  paper  describes  how  a  reasoner  can  im¬ 
prove  its  understanding  of  an  incompletely 
understood  domain  through  the  application 
of  what  it  already  knows  to  novel  problems 
in  that  domain;  Recent  work  in  AI  has 
dealt  with  the  issue  of  using  past  explana¬ 
tions  stored  in  the  reasoner’s  memory  to  un¬ 
derstand  novel  situations.  However,  this  pro¬ 
cess  assumes  that  past  explanations  are  well 
understood  and  provide  good  “lessons”  to  be 
used  for  future  situations.  This  assumption 
is  usually  false  when  one  is  learning  about 
a  novel  domain,  since  situations  encountered 
previously  in  this  domain  might  not  have 
been  understood  completely.  Instead,  it  is 
reasonable  to  assume  that  the  reasoner  would 
have  gaps  in  its  knowledge  base.  By  rea¬ 
soning  about  a  new  situation,  the  reasoner 
should  be  able  to  fill  in  these  gaps  as  new 
information  came  in,  reorganize  its  explana¬ 
tions  in  memory,  and  gradually  evolve  a  bet¬ 
ter  understanding  of  its  domain. 

We  present  a  story  understanding  program 
that  retrieves  past  explanations  from  situa¬ 
tions  already  in  memory,  and  uses  them  to 
build  explanations  to  understand  novel  sto¬ 
ries  about  terrorism.  In  doing  so,  the  system 
refines  its  understanding  of  the  domain  by 
filling  in  gaps  in  these  explanations,  by  elab¬ 
orating  the  explanations,  or  by  learning  new 
indices  for  the  explanations.  This  is  a  type 
of  incremenial  learning  since  the  system  im¬ 
proves  its  explanatory  knowledge  of  the  do¬ 
main  in  an  incremental  fashion  rather  than 
by  learning  new  XPs  as  a  whole. 

1  Case-based  learning 

Case-based  reeisoning  and  learning  programs  deal  with 
the  issue  of  using  past  experiences  or  cases  to  under¬ 
stand,  plan  for,  or  learn  from  novel  situations  [Kolod- 
ner,  1988;  Hammond,  1989].  This  happens  accord¬ 
ing  to  the  following  process:  (a)  Use  problem  descrip¬ 
tion  to  get  reminded  of  old  case,  (b)  Retrieve  the 


results  (lessons,  explanations,  plans)  of  processing  the 
old  case  and  give  them  to  the  understander,  planner 
or  problem-solver,  (c)  Adapt  the  results  from  the  old 
case  to  the  specifics  of  the  new  situation,  (d)  Apply 
the  adapted  results  to  the  new  situation. 

The  intent  behind  case-based  reasoning  is  to  avoid 
the  effort  involved  in  re-deriving  these  lessons,  expla¬ 
nations  or  plans  by  simply  reusing  the  results  from  pre¬ 
vious  cases.  However,  this  process  assumes  that  past 
cases  are  well  understood  and  provide  good  “lessons” 
to  be  used  for  future  situations,  since  it  is  these  very 
cases  that  determine  the  performance  of  the  system  in 
new  situations.  This  assumption  is  usually  false  when 
one  is  learning  about  a  novel  domain,  since  cases  en¬ 
countered  previously  in  this  domain  might  not  have 
been  understood  completely.  Instead,  it  would  be  rea¬ 
sonable  to  assume  that  the  reaisoner  would  have  gaps 
in  the  knowledge  represented  by  these  cases. 

Even  if  past  cases  are  not  well  understood,  they 
can  still  be  used  to  guide  processing  in  new  situations. 
However,  in  addition  to  using  the  peist  case  to  under¬ 
stand  the  new  situation,  a  reasoner.can  also  learn  more 
about  the  old  case  itself,  and  thus  improve  its  under¬ 
standing  of  the  domain.  This  is  an  important  problem 
that  has  not  been  addressed  in  case-based  reasoning 
research,  and  one  that  is  suited  to  a  machine  learn¬ 
ing  approach  in  which  learning  occurs  incrementally 
as  these  gaps  are  filled  in  through  experience. 

This  paper  describes  a  case-based  story  understand¬ 
ing  system  that  retrieves  past  explanations  from  situa¬ 
tions  already  in  memory,  and  uses  them  to  build  expla¬ 
nations  to  understand  novel  situations  encountered  in 
newspaper  stories  about  terrorism.  The  system  learns 
in  an  incremental  nranner,  by  filling  in  the  gaps  in  the 
retrieved  explanation  that  is  being  used  as  a  precedent 
in  understanding  the  new  situation.  What  is  done  with 
the  newly  learnt  information  depends  on  the  kind  of 
“knowledge  gap”  the  system  is  trying  to  fill.  The  new 
piece  of  knowledge  could  result  in  a  new  explanation  in 
memory;  it  could  be  used  to  fill  in  a  gap  in  an  existing 
explanation;  it  could  be  used  to  elaborate  an  exist¬ 
ing  explanation  if  that  explanation  was  not  detailed 
enough  to  deal  with  the  new  situation;  or  it  could  be 
used  to  reorganize  or  re-index  knowledge  in  memory 
to  allow  the  reasoner  to  use  what  it  already  knows  in 
novel  situations  to  which  that  piece  of  knowledge  had 
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not  been  applied  before.  Each  type  of  learning  leaves 
the  system  a  little  closer  to  a  complete  understanding 
of  its  domain.  Each  type  of  learning  could  also  result 
in  a  new  set  of  gaps  as  the  system  realizes  what  else  it 
needs  to  learn  about,  which  in  turn  drives  the  system 
towards  further  learning. 

Much  of  real-world  learning  is  an  incremental  pro¬ 
cess  of  this  type.  A  reasoner  learns  by  modifying  what 
it  already  knows  using  little  pieces  of  new  information 
that  it  comes  across  during  its  experiences.  This  pa¬ 
per  presents  a  theory  of  incremental  learning  for  case- 
based  story  understanding. 

2  Explanation  patterns 

Before  we  can  discuss  the  learning  process,  we  must 
describe  what  needs  to  be  learned.  This  depends  on 
the  purpose  to  which  the  learned  knowledge  will  be 
put.  Consider  the  problem  of  building  motivational 
explanations  for  the  purpose  of  understanding  stories. 
An  understander  could  construct  such  explanations  by 
using  rules  connecting  typical  goals  and  plans  of  peo¬ 
ple  (e.g.,  [Wilensky,  1978]).  However,  this  would  be 
very  inefficient  in  complicated  situations,  where  moti¬ 
vational  causal  chains  could  be  several  steps  long.  To 
get  around  this  problem,  a  case-based  understander 
uses  pre-stored  explanations  for  stereotypical  situa¬ 
tions.  These  explanations  represent  standard  patterns 
that  are  observed  in  these  situations,  and  hence  are 
called  explanation  patterns  [Schank,  1986].  When  the 
understander  sees  a  situation  for  which  it  has  a  canned 
explanation  pattern  (XP),  it  tries  to  apply  the  XP  to 
avoid  detailed  analysis  of  the  situation  from  scratch. 
Thus  an  XP  is  like  an  abstract  case;  it  represents  a 
generalization  based  on  the  understander’s  experiences 
that  can  be  used  as  a  paradigmatic  case  for  similar  sit¬ 
uations  in  the  future. 

For  example,  a  “blackmail”  situation  may  be  repre¬ 
sented  by  the  following  XP  (xp-blackmail):^ 

(1)  The  blackmailee  has  a  goal  Gl. 

(2)  The  blackmailer  has  a  goal  G2,  and  the  black¬ 
mailee  does  not  have  the  goal  G2  (since  otherwise 
he  or  she  would  satisfy  the  goal  without  needing 
to  be  threatened). 

(3)  The  blackmailee  has  a  goal  G3,  which  he  or  she 
values  above  goal  Gl. 

(4)  The  blackmailer  threatens  to  violate  G3  unless  the 
blackmailee  performs  an  action  A  that  satisfies 
G2,  even  though  the  action  would  have  a  negative 
effect  of  violating  Gl. 

3  Learning  explanation  patterns 

How  are  stereotypical  XPs  formed  in  memory?  The 
work  in  explanation-based  learning  focusses  on  the 
problem  of  learning  through  the  generalization  of 
causal  structures  underlying  novel  situations  [DeJong 
and  Mooney,  1986;  Mitchell  et  ai,  1986].  However, 

’  Details  of  XP  represent  itions  may  be  found  in  [Ham, 
1989]. 


it  is  difficult  to  determine  the  correct  level  of  gener¬ 
alization.  Furthermore,  many  stories  do  not  provide 
enough  information  to  prove  that  the  explanation  is 
correct.  The  understander  must  often  content  itself 
with  two  or  more  competing  hypotheses,  or  otherwise 
jump  to  a  conclusion.  This  means  that  in  a  real  world 
situation,  an  explanation-based  learning  system  may 
still  need  to  deal  with  the  problem  of  incomplete  or 
incorrect  domain  knowledge. 

Thus  the  system’s  memory  of  past  experiences  will 
not  always  contain  “correct”  cases  or  “correct”  expla¬ 
nations,  but  rather  one  or  more  hypotheses  about  what 
the  correct  explanation  might  have  been.^  These  hy¬ 
potheses  often  have  questions  attached  to  them,  rep¬ 
resenting  what  is  still  not  understood  or  verified  about 
those  hypotheses.  As  the  understander  reads  new  sto¬ 
ries,  it  is  reminded  of  past  cases,  and  of  old  explana¬ 
tions  that  it  has  tried.  In  attempting  to  apply  these 
explanations  to  the  new  situation,  its  understanding 
of  the  old  case  gradually  gets  refined.  New  indices 
are  learned  as  the  understander  learns  more  about  the 
range  of  applicability  of  the  XP.  The  XP  is  re-indexed 
in  memory  and  is  more  likely  to  be  recalled  only  in 
relevant  situations. 

Thus  XP  learning  is  an  incremental  process  of  the¬ 
ory  formation,  involving  both  case-based  reasoning 
and  explanation-based  learning  processes. 

4  The  AQUA  program 

AQUA  is  a  story  understanding  program  which  learns 
about  terrorism  by  reading  newspaper  stories  about 
terrorist  incidents  in  the  Middle  East  [Ram,  1987; 
Schank  and  Ram,  1988;  Ram,  1989].  AQUA  reads 
stories  about  suicide  bombing  and  attempts  to  under¬ 
stand  them  by  constructing  causal  and  motivational 
explanations  for  the  events  in  the  stories. 

aqua’s  case  memory  is  based  on  XPs  that  have 
been  used  to  explain  peist  situations.  AQUA  improves 
its  explanatory  knowledge  of  the  domain  through  a 
process  of  re-indexing  and  incremental  modification  of 
these  XPs.  For  example,  suppose  AQUA  has  just  read 
the  following  suicide  bombing  story  (New  York  Times, 
April  14,  1985): 

Boy  Says  Lebanese  Recruited  Him  as 
Car  Bomber. 

JERUSALEM,  April  13  —  A  16-year-old 
Lebanese  was  captured  by  Israeli  troops  hours 
before  he  was  supposed  to  get  into  an  explosive¬ 
laden  car  and  go  on  a  suicide  bombing  mission 
to  blow  up  the  Israeli  Army  headquarters  in 
Lebanon.  ... 

What  seems  most  striking  about  [Mo¬ 
hammed]  Burro’s  account  is  that  although  he 

^Actually,  a  single  story  or  episode  can  provide  more 
than  one  “case,”  each  case  being  a  particular  interpretation 
or  dealing  with  a  particular  aspect  of  the  story.  For  an 
explanation  program,  each  anomaly  in  a  story,  along  with 
the  corresponding  set  of  explanatory  hypotheses,  can  be 
used  as  a  case. 
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is  n  Shiite  Moslem,  he  from  a  secular  fam* 
ily  background.  He  sprnt  ^ls  free  time  not  in 
prayer,  he  said,  but  riding  his  motorcycle  and 
playing  pinball.  According  to  his  account,  he 
was  not  a  fanatic  who  wanted  to  kill  himself  in 
the  cause  of  Islam  or  anti-Zionism,  but  was  re¬ 
cruited  for  the  suicide  mission  through  another 
means:  blackmail. 


After  reading  this  story,  AQUA  builds  the  following 
hypothesis  tree  in  memory,  representing  an  anomaly 
(  Why  would  the  bomber  perform  an  action  that  resulted 
in  his  own  death?),  alternative  hypotheses  constructed 
by  applying  known  XPs  to  the  anomalous  situation 
(religious  fanatic  and  blackmail),  questions  that  would 
verify  these  hypotheses,  and  answers  to  these  ques¬ 
tions,  if  any:® 


WHY  DID  THE  BOHBER  DQ  THE  SUICIDE  BOMBING? 


/ 

THE  BOMBER  VAS  A 
RELIGIOUS  FANATIC 
(refuted) . 


\ 

THE  BOMBER  HAS 
BLACKtUILED  INTO 
THE  SUICIDE  BOMBING. 


/ 

WHAT  IS  THE 
RELIGION  OF 
THE  BOHBER? 


\ 


I 


VHAT  IS  THE  WHAT  COULD  THE 
RELIGIOUS  ZEAL  BOMBER  WANT  MORE 
OF  THE  BOMBER?  THAN  HIS  OHM  LIFE? 


1 


SHIITE  MOSLEM  NOT  A  FANATIC 


The  final  explanation  built  for  this  story  involves  a 
novel  application  of  a  stereotypical  XP,  xp-blackmail. 
Even  though  the  system  already  knows  about  black¬ 
mail,  it  learns  a  new  variant  of  this  XP  (xp- 
blackmail-suicide-bombing),  based  on  the  partic¬ 
ular  manner  in  which  zp-blackmail  was  adapted  to 
fit  the  story.  AQUA  also  learns  indices  to  the  new  XP. 
Both  kinds  of  learning  are  important  in  a  case-based 
reasoning  system.  Let  us  start  with  the  latter. 


5  Learning  indices  for  explanation 
patterns 

Regardless  of  whether  a  new  XP  is  learned  from 
scratch  or  by  applying  an  existing  XP  to  a  new  sit¬ 
uation,  the  XP  needs  to  be  indexed  in  memory  ap¬ 
propriately  so  that  it  can  be  used  in  future  situa¬ 
tions  in  which  it  is  likely  to  be  useful.  Ideally,  an  XP 
should  be  indexed  in  memory  such  that  it  is  retrieved 
only  in  those  situations  in  which  it  is  applicable.  But 
this  is  impossible  in  practice.  For  example,  consider 
the  applicability  conditions  for  “blackmail.”  In  gen¬ 
eral,  blackmail  is  a  possibility  whenever  “someone  does 
something  he  doesn’t  want  to  do  because  not  doing  it 
results  in  something  worse  for  him.”  But  trying  to 
show  this  in  general  is  very  hard.  Thus,  in  addition  to 
general  applicability  conditions,  an  understander  must 
learn  specific,  sometimes  superficial,  features  that  sug¬ 
gest  possibly  relevant  XPs  even  though  they  may  not 
completely  determine  the  applicability  of  the  XP  to 

®Tlie  understanding  process  by  which  AQUA  builds  this 
hypothesis  tree  is  irrelevant  for  the  purposes  of  this  paper. 
Details  may  be  found  in  [Ram,  1989]. 


the  situation.  For  example,  a  classic  blackmail  situa¬ 
tion  is  one  where  a  rich  businessman  who  is  cheating 
on  his  wife  is  blackmailed  for  mon^y  using  the  threat 
of  exposure.  If  one  read  about  a  rich  businessman 
who  suddenly  began  to  withdraw  large  sums  of  money 
from  his  bank  account,  one  would  expect  to  think  of 
the  possibility  of  blackmail.  However,  one  does  not 
normally  think  of  blackmail  when  one  reads  a  story 
about  suicide  bombing,  although  theoretically  it  is  a 
possible  explanation. 

The  point  is  that  XPs  are  associated  with  stereo¬ 
typical  situations  and  people  in  memory.  An  under¬ 
stander  needs  to  learn  the  stereotypical  categories  that 
serve  as  useful  indices  for  motivational  explanations. 
This  is  a  type  of  inductive  category  formation  [Diet- 
trich  and  Michalski,  1981);  however,  the  generalization 
process  is  constrained  so  that  the  features  selected  for 
generalization  are  those  that  are  causally  relevant  to 
the  explanations  being  indexed  [Flann  and  Dietterich, 
1989). 

AQUA  indexes  motivational  XPs  in  memory  us¬ 
ing  typical  contexts  in  which  the  XPs  might  be  en¬ 
countered  (situation  indices),  as  well  as  character 
stereotypes  representing  typical  categories  of  people  to 
whom  the  XPs  might  be  applicable  (stereotype  indices) 
[Ram,  1989).  In  the  above  example,  AQUA  learns  a 
new  context  for  blackmail  (suicide  bombing),  as  well 
as  a  new  character  stereotype  representing  the  type 
of  person  who  one  might  expect  to  see  involved  in  a 
“blackmailed  into  suicide  bombing”  explanation.  Let 
us  discuss  how  AQUA  learns  these  indices. 

5.1  Learning  situation  indices 

AQUA  learns  new  contexts  (e.g.,  “suicide  bombing”) 
for  stereotypical  XPs  (e.g.,  “blackmail”)  which  are 
then  used  as  situation  indices  for  these  XPs  in  the 
future.  The  main  issue  here  is  how  far  the  context 
should  be  generalized  before  it  is  used  as  an  index.  In 
the  above  example,  should  the  new  situation  index  for 
blackmail  be  suicide-bombing,  suicide,  bombing, 
destroy,  or  indeed  any  MOP  (action)  with  a  neg¬ 
ative  side  effect  for  the  actor?  The  issue  here  isn’t 
one  of  correctness  but  of  utility.  As  discussed  earlier, 
xp-blackmail  is  a  possibility  whenever  the  actor  does 
something  he  would  ordinarily  not  do  because  of  a  neg¬ 
ative  side  effect.  However,  XP  theory  tries  to  replace 
generalized  reasoning  of  this  form  with  specific  reason¬ 
ing  about  stereotypical  situations.  The  latter  is  more 
efficient  even  though  it  is  less  general. 

After  reading  the  above  story,  for  example,  one 
would  expect  to  think  of  blackmail  when  one  reads  an¬ 
other  story  about  a  suicide  bombing  attack.  However, 
one  would  probably  not  think  of  blackmail  on  read¬ 
ing  any  story  about  suicide,  say,  a  teenager  killing 
himself  after  failing  his  high  school  examinations,  even 
though  theoretically  it  is  a  possible  explanation.  Fur¬ 
thermore,  it  would  not  be  useful  to  index  the  new  XP 
under  bombing  in  general  (as  opposed  to  suicide¬ 
bombing  in  particular),  since  the  particular  goal  viola¬ 
tion  of  the  p-lile  goal  is  central  to  this  explanation. 

Thus  in  the  above  example,  AQUA  uses  suicide- 
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Figure  1:  Learning  situation  indices  for  XPs.  Upward  lines  represent  isa  links,  and  downward  dotted  lines  represent 
scenes  of  MOPs.  Heavy  lines  represent  situation  indices,  which  point  from  MOPs  to  XPs.  Here,  AQUA  has  just  built  a 
situation  index  from  suicide-bombing  to  a  copy  of  xp-blackmail. 


bombing  as  the  situation  index  for  the  new  variant  of 
xp-blackmail  (figure  1).  After  reading  several  stories 
about  blackmail,  AQUA  would  know  about  different 
stereotypical  situations  in  which  to  use  the  blackmail 
explanation,  rather  than  a  generalized  logical  descrip¬ 
tion  of  every  situation  in  which  blackmail  is  a  possible 
explanation.  In  other  words,  AQUA  would  have  in¬ 
dexed  a  copy  of  xp-blackmail  under  all  the  MOPs  for 
which  it  has  seen  xp-blackmail  used  as  an  explana¬ 
tion.  Whenever  these  MOPs  are  encountered,  AQUA 
would  retrieve  the  new  blackmail  XP  (if  the  other  in¬ 
dices  are  also  present).^  The  reason  that  a  copy  of  the 
original  XP  is  used  is  that  the  XP,  once  copied,  will 
need  to  be  modified  for  that  particular  situation,  as 
discussed  below. 

5.2  Learning  stereotype  indices 

The  main  constraint  on  a  theory  of  stereotype  learning 
is  that  the  kinds  of  stereotypes  learned  must  be  useful 
in  retrieving  explanations.  In  other  words,  they  must 
provide  the  kinds  of  discrimination  that  are  needed  for 
indexing  XPs  in  memory.  Since  volitional  explanations 
are  concerned  with  goals,  goal  orderings,  plans  and 
beliefs  of  characters,  the  learning  algorithm  must  pro¬ 
duce  typical  collections  of  goals,  goal-orderings,  plans 

^AQUA  can  still  understand  other  blackmail  situations 
that  it  has  not  learned  about  as  yet,  as  it  did  the  story  in 
the  above  example.  Thus  not  having  a  situation  index  for 
an  XP  does  not  necessarily  mean  that  the  XP  cannot  be 
applied  to  the  situation,  but  rather  that  this  XP  is  not  one 
that  would  ordinarily  come  to  mind  in  that  situation. 


and  beliefs,  along  with  predictive  features  for  these  el¬ 
ements.  Such  a  collection  is  called  a  character  stereo¬ 
type. 

Character  stereotypes  serve  as  motivational  cate¬ 
gories  of  characters  and  are  an  important  index  for 
XPs  in  memory.  In  the  above  example,  AQUA  learns 
a  new  stereotype  (stereotype. 79)  representing  a  typ¬ 
ical  Lebanese  teenager  who  might  be  blackmailed  into 
suicide  bombing,  which  is  used  to  index  the  blackmail 
XP.  The  stereotype  is  built  from  the  novel  blackmail 
explanation  by  generalizing  the  features  of  the  charac¬ 
ter  involved  in  that  explanation: 


Answering  question:  WHY  DID  THE  BOY  DO  THE  SUICIDE 
BOMBING? 

with:  THE  BOY  WAS  BLACKMAILED  INTO 
DOING  THE  SUICIDE  BOMBING. 

Novel  explanation  for  A  SUICIDE  BOMBING! 

Building  new  stereotype  STEREOTYPE. 79: 

Typical  goals: 

P-LIFE  (in) 

A-DESTROY  (OBJECT)  (out) 

AVOIDANCE-GOAL  (STATE)  (question) 

Typical  goal-orderings: 

AVOIDANCE-GOAL  (STATE)  over  P-LIFE  (question) 
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Typical  beliefs: 

RELIQIOUS-ZEAL  -  NOT  A  FANATIC  (in) 
Typical  features: 

AGE  -  TEENAGE  AGE  (hypothesized) 
RELIGION  -  SHIITE  MOSLEM  (hypothesized) 
GENDER  >•  MALE  (hypothesized) 
NATIONALITY  -  LEBANESE  (hypothesized) 

Indexing  XP-BLACKMAIL-SUICIDE-BOMBING 
Stereotype  index  >  STEREOTYPE. 79 
Situation  index  -  SUICIDE-BOMBING 


The  label  in  (out)  marks  features  that  are  known  to 
be  true  (false)  of  this  stereotype  [Doyle,  1979].  These 
features  are  definitional  of  the  stereotype.  The  label 
question  marks  features  that  are  in  but  incomplete. 
In  this  case,  (AVOIDANCE-GOAL  (STATE))  refers  to  an 
unknown  goal  that  needs  to  be  filled  in  when  the  in¬ 
formation  comes  in.  This  is  represented  as  a  goal 
with  an  unknown  goal-object.  Finally,  the  label 
hypothesized  marks  features  that  were  true  in  this 
story  but  were  not  causally  relevant  to  the  explana¬ 
tion.  These  features  are  retained  for  the  purposes  of 
recognition  and  learning.  Since  AQUA  does  not  as¬ 
sume  that  its  explanations  are  complete,  there  is  the 
possibility  of  learning  more  about  this  explanation  in 
the  future  that  would  help  to  determine  whether  these 
features  have  explanatory  significance.  This  has  not 
yet  been  implemented  in  AQUA. 

The  stereotype  is  used  to  index  the  new  explanation 
in  memory  (figure  2).  After  reading  this  story,  AQUA 
uses  the  new  stereotype  to  retrieve  the  blackmail  ex¬ 
planation  when  it  reads  other  stories  about  Lebanese 
teenagers  going  on  suicide  bombing  missions. 

This  stereotype  is  built  through  generalization  un¬ 
der  causal  constraints  from  the  hypotheses  that  were 
considered,  including  the  ones  that  were  ultimately  re¬ 
futed.  The  causal  constraints  are  derived  both  from 
the  successful  explanation  (blackmail)  as  well  as  from 
unsuccessful  hypotheses,  if  any  (here,  religious  fanati¬ 
cism). 


5.2.1  Learning  from  successful  explanations 

Clearly,  much  of  stereotype.  79  comes  from  the  moti¬ 
vational  aspects  of  the  blackmail  explanation.  AQUA 
retains  those  goals,  goal  orderings  and  beliefs  of  the 
character  in  the  story  that  are  causally  implicated  in 
the  blackmail  explanation.  Since  blackmail  relies  on  a 
goal  ordering  between  two  goals,  one  of  which  is  sac- 
.rificed  for  the  other,  the  stereotype  must  specify  that 
the  character  has  a  goal  that  he  or  she  values  above 
p-life.  The  stereotype  also  specifies  that  the  char¬ 
acter  would  normally  not  have  the  goal  of  performing 
terrorist  missions,  since  this  is  part  of  the  blackmail 
explanation.  In  the  above  story,  AQUA  infers  the  fol¬ 
lowing  goals  and  goal-orderings  for  the  actor  (corre¬ 
sponding  to  (1),  (2)  and  (3)  of  xp-blackmail,  page  2): 


Building  n«w  Bt«r«otyp«  (STEREOTYPE. 79) : 

Inferring  GOALS 

from  XP-BLACKMAIL  (successful): 

THE  ACTOR  WANTED  TO  PRESERVE  HIS  OWN  LIFE. 
THE  ACTOR  DID  NOT  WANT  TO  PERFORM  THE 
TERRORIST  MISSION. 

THE  ACTOR  WANTED  TO  AVOID  SOMETHING. 

Inferring  GOAL-ORDERINGS 
from  XP-BLACKMAIL  (successful): 

THE  GOAL  OF  THE  ACTOR  TO  AVOID  SOMETHING 
HAS  MORE  IMPORTANT  THAN  THE  GOAL  OF 
THE  ACTOR  TO  PRESERVE  HIS  OWN  LIFE. 

These  goals  and  goal-orderings  are  added  to  the  stereo¬ 
type  being  built.  At  this  point,  stereotype. 79  has 
the  following  features: 

Typical  goals: 

P-LIFE  (in) 

A-DESTROY  (OBJECT)  (out) 

AVOIDANCE-GOAL  (STATE)  (question) 

Typical  g02d-crderings : 

AVOIDANCE-GOAL  (STATE)  over  P-LIFE  (question) 

5.2.2  Learning  from  failed  explanations 

Many  explanation-based  learning  programs  learn  only 
from  positive  examples  (e.g.,  [Mooney  and  DeJong, 
1985;  Segre,  1987)).  However,  it  is  also  possible  to 
apply  this  technique  to  learn  from  negative  examples 
(e.g.,  [Mostow  and  Bhatnagar,  1987;  Gupta,  1987]). 
AQUA  uses  refuted  hypotheses  to  infer  features  that 
should  not  be  present  in  the  newly  built  stereotype. 
These  are  features  which,  if  present,  would  have  led  to 
the  hypothesis  being  confirmed. 

For  example,  in  the  blackmail  story,  AQUA  knows 
that  the  person  being  blackmailed  is  not  a  religious 
fanatic,  since  the  religious  fanatic  explanation,  which 
depended  on  this  fact,  has  been  refuted.  The  kind  of 
person  likely  to  be  blackmailed  into  suicide  bombing 
is,  therefore,  not  a  religious  fanatic.®  This  feature  is 
recorded  in  the  newly  built  stereotype. 

Building  ne»  stereotype  (STEREOTYPE. 79) : 

XP-RELIGIOUS-FANATIC  failed  because: 

THE  BOY  DID  NOT  BELIEVE  FANATICALLY  IN  THE 
SHIITE  MOSLEM  RELIGION. 

Inferring  BELIEFS 

from  XP-RELIGIOUS-FANATIC  (failed) : 

THE  ACTOR  DID  NOT  BELIEVE  FANATICALLY  IN 
A  RELIGION. 

This  results  in  a  new  belief  being  added  to 
stereotype. 79: 

®As  before,  this  is  a  stereotypical  inference  and  not  a 
logically  correct  one.  A  religious  fanatic  could  indeed  be 
blackmailed  into  suicide  bombing;  however,  on  reading  a 
story  about  a  religious  fanatic  going  on  a  suicide  bombing 
mission,  blackmail  would  not  normally  come  to  mind.  This 
means  that  xp-blac)cnail-suicide-boBibing  should  not  be 
indexed  under  religious-fanatic,  at  least  on  the  basis  of 
this  example. 
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Figure  2:  Learning  stereotype  indices  for  XPs.  Upward  lines  represent  isa  links,  and  downward  dotted  lines  represent 
scenes  of  MOPs.  Heavy  lines  represent  indices  to  XPs.  Here,  AQUA  has  just  built  a  stereotype  index  from  stereotype.  79, 
representing  a  lebanese-teenager,  to  xp-blacknail -suicide-bombing. 


Typical  beliefs: 

BELIQIOUS-ZaL  -  HOT  A  FANATIC  (in) 

The  reason  that  learning  from  the  failed  explanation 
works  in  this  example  is  that  the  blackmail  explana¬ 
tion  specifies  that  the  person  being  blackmailed  would 
normally  not  have  the  goal  to  perform  that  action. 
This  rules  out  other  explanations  which  would  result 
in  this  goal.  Our  theory  does  not  deal  with  the  issue  of 
multiple  successful  explanations;  more  research  needs 
to  be  done  in  this  area. 

6  Modifying  existing  explanation 
patterns 

6.1  Associating  new  questions  with  XPs 

Suppose  AQUA  reads  the  blackmail  story  with  only 
the  religious  fanatic  XP  for  suicide  bombing  in  mem¬ 
ory.  When  reading  this  story,  AQUA  is  handed  an 
explanation  for  the  suicide  bombing:  the  story  explic¬ 
itly  mentions  that  the  bomber  was  blackmailed.  In  a 
sense,  then,  the  story  has  been  understood  since  an  ex¬ 
planation  for  the  bombing  has  been  found.  However, 
one  could  not  really  say  that  AQUA  had  understood 
the  story  if  it  didn’t  ask  the  question,  Whai  could  the 
hoy  wani  mors  ikon  kis  oum  life?  Unless  this  question 
is  raised  while  reading  the  story,  one  would  have  to  say 
that  AQUA  had  missed  the  point  of  the  story. 

Such  questions  correspond  to  gaps  in  the  explana¬ 
tion  structures  that  are  built  during  the  understand¬ 
ing  process  (figure  3).  These  questions  are  associated 
with  the  XP,  and  may  be  answered  later  in  the  story 
or  when  the  XP  is  applied  to  a  future  story.  When 
they  are  answered,  the  understander  can  elaborate  and 
modify  the  XP,  thus  achieving  a  better  understanding 
of  the  causality  represented  by  the  XP. 


6.2  Incremental  refinement  of  XPs  by 
answering  questions 

In  addition  to  raising  new  questions,  of  course,  an  un¬ 
derstander  must  answer  the  questions  that  is  already 
has  in  order  to  improve  its  knowledge  of  the  domain. 
AQUA  uses  its  questions  to  focus  the  understanding 
process,  and  learns  when  these  questions  are  answered. 

For  example,  consider  the  following  story: 

JERUSALEM  —  A  young  girl  drove  an 
explosive-laden  car  into  a  group  of  Israeli  guards 
in  Lebanon.  The  suicide  attack  killed  three 
guards  and  wounded  two  others.  ... 

The  driver  was  identified  as  a  16-year-old 
Lebanese  girl.  ...  Before  the  attack,  she  said  that 
a  terrorist  organization  had  threatened  to  harm 
her  family  unless  she  carried  out  the  bombing 
mission  for  them.  She  said  that  she  was  prepared 
to  die  in  order  to  protect  her  family. 

When  this  story  is  read,  AQUA  retrieves  the  new 
xp-blaclonail-suicide-bombing  and  applies  it  to  the 
story.  The  question  that  is  pending  along  with  this 
explanation  is  also  instantiated.  When  the  question 
is  answered,  it  is  replaced  by  a  new  node  representing 
the  protect-xamily  goal,  and  becomes  part  of  xp- 
blackmail-snicide-bombing.  Since  no  explanations 
are  known  for  tlie  newly  added  node,  this  in  turn  be¬ 
comes  a  new  question  about  the  elaborated  XP  (not 
shown  in  the  figure).  The  question  is  seeking  a  rea¬ 
son  for  the  unusual  goal-ordering  of  the  actor,  in 
which  protect-family  is  given  a  higher  priority  than 
p-life. 

When  the  elaborated  XP  is  applied  to  a  new  sui¬ 
cide  bombing  story,  the  new  node  will  now  be  one  of 
the  premises  of  the  hypothesis,  causing  AQUA  to  ask 
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Figure  3:  Associating  new  questions  with  XPs.  The  XP  represents  a  situation  in  which  an  ager.t  A  volitionally  performs 
(chooees-to-enter)  an  action  whose  outcome  is  known  (knous-result)  to  be  the  death-state  of  A,  as  well  as  an  unknown 
state  that  A  wants  more  than  he  wants  to  avoid  his  death-state  (the  goal-ordering).  The  unknown  goal  represents 
the  new  question,  What  could  the  actor  want  more  than  his  own  life?  This  is  depicted  as  an  empty  box,  representing  a 
gap  in  the  program’s  knowledge.  The  XP  is  elaborated  by  filling  in  this  gap  when  this  question  is  answered. 
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Figure  4:  Elaborating  an  XP  through  incremental  learning.  The  changed  portion  is  depicted  as  a  newly  filled-in  box, 
representing  the  answering  of  the  question  that  was  indexed  at  that  point  (compare  with  figure  3). 
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whether  i.he  actor  was  trying  to  protect  his  family. 
This  reflects  a  deeper  understanding  of  this  particular 
scenario  and  is  shown  in  figure  4.  The  new  question 
will  also  be  instantiated,  causing  AQUA  to  look  for  an 
explanation  for  the  unusual  goal-ordering.  Should 
new  questions  be  raised  and  then  answered  during  fu¬ 
ture  stories,  AQUA  will  again  be  able  to  elaborate  this 
XP  in  a  similar  manner.  Thus  AQUA  evolves  a  better 
understanding  of  the  “blackmailed  into  suicide  bomb¬ 
ing”  scenario  through  a  process  of  question  asking  and 
answering. 

7  Conclusions 

Explanation  patterns  are  used  for  constructing  expla¬ 
nations  for  anomalous  situations  by  applying  stereo¬ 
typical  packages  of  causality  from  similar  situations 
encountered  earlier.  Thus  XPs  ace  abstract  cases  that 
are  used  as  paradigmatic  exa^nciss  of  stereotypical  sit¬ 
uations. 

This  paper  presents  a  theory  f  XP  learning  through 
the  incremental  modification  'f  e.\isting  XPs,  using 
explanation-based  learning  techniques  to  constrain  the 
m^ification  process.  The  modifications  involve  the 
adaptation  and  elaboration  of  XPs,  as  well  as  the 
learning  of  indices-  fov  XPs.  Both  types  of  knowledge 
are  essential  in  axij  case-based  reasoning  system.  The 
theory  is  implemented  in  the  AQUA  program,  which 
learns  about  terrorism  by  reading  -(“wspaper  stories 
about  unusual  terrorist  incidents  in  t-i-;  Middle  East. 
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Abstract 

This  paper  describes  the  results  obtained 
in  applying  the  learning  system  ENIGMA  to  a 
fault  diagnosis  problem  of  electromechanical 
devices  at  ENICHEM  (Ravenna,  Italy).  The 
system  ENIGMA  is  capable  of  learning 
structured  knowledge  from  examples  and  a 
domain  theory,  using  an  integrated 
inductivc/deductive  paradigm. 

The  results  are  compared  with  the  ones 
obtained  by  an  expert  system,  designed  for  the 
same  task,  in  which  the  knowledge  base  was 
acquired  using  the  traditional  method  of  expert 
interview.  The  comparison  indicates  that 
performances  obtained  by  the  learning  system 
are  systematically  better  than  the  ones 
obtained  by  the  manually  developed  expert 
system.  The  conclusion  is  that,  even  if  still 
leaving  room  for  improvements,  automated 
learning  is  a  viable  approach  to  the 
construction  of  expert  systems,  from  the  point 
of  view  of  both  obtainable  performance  and  of 
limiting  the  development  time  and  cost. 

1.  Introduction 

It  is  widely  recognized  that  the  feasibility  of 
expert  systems  exhibiting  human  like  performances 
strongly  depends  upon  the  possibility  of  developing 
mechanisms  for  automating  the  processes  of  knowledge 
acquisition  and  maintenance.  In  the  last  decade,  a 
number  of  research  projects  have  been  devoted  to 
machine  learning  and  knowledge  acquisition.  Although 
many  steps  ahead  have  been  made,  especially  for  the 
problem  of  learning  concept  descriptions  from 
examples,  we  have  still  to  recognize  that  the  problem 
is,  in  general,  extremely  hard.  This  is  confirmed  by  the 
fact  that  a  number  of  learning  systems  have  been 
described  in  the  literature,  but  very  few  real  applications 
were  addressed,  in  which  machine  learning  proved  to  be 
capable  of  generating  a  knowledge  base  with  the  same 
(or  better)  performance  as  the  one  constructed  b"  a 
human  expert.  Wc  mention,  in  this  sense,  the  results 
obtained  by  Michalski  in  developing  automatic 
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classification  systems  for  agricultural  [7]  and  medical 
[8]  applications.  In  domains  where  the  learning  events 
can  be  represented  as  vectors  of  <altributc,  value> 
pairs,  interesting  results  have  been  obtained  using 
decision  trees  [1 1,12]. 

Many  of  the  above  mentioned  results  in 
classification  systems  have  been  obtained  using  mainly 
inductive  methods;  however,  an  important  ri^uirement, 
which  characterizes  many  applications,  is  that  the 
learned  rules  be  understandable  in  the  light  of  a  pre¬ 
existing  knowledge  of  the  domain;  this  is  particularly 
true  for  diagnostic  systems. 

This  paper  describes  the  work  done,  and  the 
resuits  obtained,  in  a  pilot  project  aimed  at  checking 
the  real  possibilities  offered  by  the  slate  of  the  art  in 
Machine  Learning  in  order  to  automate  the  process  of 
knowledge  base  construction  for  an  expert  system 
oriented  to  clecuomcchanical  troubleshooting.  In  order 
to  achieve  these  goals,  the  prototype  system  ENIGMA, 
based  on  an  integrated  induciivc/dcduciive  paradigm  [4], 
has  been  developed  and  an  extensive  experimentation 
has  been  performed,  the  most  important  aspects  of 
which  will  be  described  in  the  following.  The 
performances  of  the  knowledge  base  acquired  by 
ENIGMA  have  been  compared  with  those  of  MEPS,  a 
rule  based  Expert  System,  whose  knowledge  has  been 
manually  acquired  by  interviewing  the  domain  expert 
[5].  However,  performances  arc  not  the  only  useful 
parameter  for  the  comparison:  also  knowledge 
understandability  and  meaningfulness  and  development 
time  have  been  taken  into  consideration.  The  results 
have  been  considered  encouraging  enough  to  justify  a 
large  funding,  by  ENI,  for  a  new  project  aimed  at 
developing  an  industrial  version  of  this  learning 
system. 

2.  The  Learning  Problem 

The  case  study  has  been  supplied  by  the 
Enichcm-Anic  chemical  plant  at  Ravenna,  Italy.  In  this 
plant  a  technique  of  predictive  maintenance  is  applied  to 
a  large  set  of  apparata  including  motor-pumps,  turbo- 
alternators  and  ventilators. 

All  these  apparata  share  the  common  feature  of 
possessing  a  rotating  shaft  to  which  various  rotors  are 
connected.  When  a  machine  possesses  rotating 
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elements,  several  unavoidable  vibratory  motions  are 
induced  in  its  parts;  these  vibrations  occur  also  during 
the  correct  machine  operation  and  are  not  dangerous  as 
long  as  their  amplitude  remains  limited.  When  some 
fault  occurs  in  the  machine,  new,  anomalous  vibrations 
appear,  beside  other  manifestations.  The  aim  of  the 
predictive  maintenance  is  to  locate  failures  (still  in  the 
initial  stage)  and  to  diagnose  their  severity,  through  an 
analysis  of  these  vibrations  which  is  called 
mcchanalysis.  Mcchanalysis  basically  performs  a 
Fourier  analysis  of  the  vibratory  motions  taken  in 
prespecified  and  labelled  points,  precisely  on  the 
supports  of  the  machine  components.  By  means  of  a 
special  analyzer,  the  technician  obtains,  for  each 
support,  the  amplitude  and  velocity  of  the  global 
vibration  along  the  vertical,  horizontal  and  axial 
direction;  furthermore,  the  same  data  can  be  taken  for 
each  of  the  harmonic  components  of  the  vibrations. 
Also  qualitative  evaluation  of  the  vibration  phase  can 
be  done. 

Mcchanalysis  has  strong  mathematical 
foundations  in  vibration  theory,  and,  hence,  the 
relationships  between  anomalous  frequencies  and  faults 
could  be,  in  principle,  predicted.  In  practice,  things  are 
not  so  simple,  as,  usually,  many  more  vibrations  than 
those  predicted  occur;  this  is  due  to  several  reasons, 
such  as  mechanical  imperfections  in  the  parts,  mutual 
influence  among  motions,  resonance  phenomena  and 
fault  co-occurrence.  Moreover,  a  vibration  docs  not 
begin  abruptly,  but  its  intensity  increases  over  time, 
until  a  level,  deemed  to  be  dangerous,  may  be  reached. 

The  proposed  learning  task  was  that  of 
learning  automatically  from  examples  of  mechanalysis, 
and  with  the  help  of  background  knowledge,  a 
knowledge  base  suitable  to  derive  diagnoses  of  the  type 
produced  by  human  experts.  This  task  can  be  seen  as  a 
classical  case  of  learning  concept  descriptions  [6],  but 
shows  many  difficulties  which  arc  not  present  in  other 
learning  tasks  described  in  the  literature,  residing  both 
in  the  kind  of  available  data  and  in  the 
conceptualization  of  the  problem. 

First  of  all,  the  examples  arc  complex,  each 
one  consisting,  for  the  motor-pumps,  of  about  20  to  60 
measures  taken  in  differents  points  and  conditions  of 
the  machine.  Each  measure  has  2  or  3  atU'ibutcs  (value, 
direction  and  frequency,  if  appropriate).  But,  more  than 
that,  the  examples  are  very  noisy;  in  fact  all  the 
measures  are  affected  by  large  uncertainty  margins, 
depending  both  on  the  intrinsic  limits  of  the 
measurement  apparata  and  on  the  human  subjectivity  in 
recording  the  observed  values. 

From  the  conceptual  point  of  view,  the 
principal  difficulty  resides  in  the  fact  that  the  expert's 
conclusion  mainly  arises  from  a  global  evaluation  of 
the  mechanalysis  measures,  a  particular  frequency 
pattern  (or  value)  may  acquire  great  relevance  in  a  given 
context  and  may  not  be  significant  in  another. 
Moreover,  only  few  of  the  many  measures  are 
important  for  that  particular  diagnosis  and  the  human 
expertise  consists  exactly  in  knowing  how  to  identify 
them.  This  "globality"  characteristic  makes  it  difficult 


to  define  an  adequate  description  language.  It  turns  out, 
in  fact,  that  the  single  mechanalysis  measures  are  too 
low  level  as  features  and  cannot  be  used  directly  to 
build  up  a  description  space.  On  the  contrary,  features 
of  a  higher  level,  defined  in  terms  of  groups  of  items, 
arc  to  be  introduced  in  order  to  describe  hypotheses. 
This  form  of  "constructive”  learning  [10]  has  been 
strongly  guided,  in  the  current  implementation,  by  the 
domain  theory. 

3.  The  Learning  System 

The  system  ENIGMA  is  basically  an  evolution 
of  an  earlier  version,  the  system  ML-SMART  [2,3], 
enhanced  in  order  to  include  deductive  capabilities  as 
described  in  [4].  In  fact,  several  attempts  to  apply  to  the 
described  case  study  the  former  version  of  ML- 
SMART,  which  was  a  purely  inductive  system, 
generated  knowledge  that  was  very  difficult  to 
understand  in  the  light  of  the  existing  domain  theory. 

We  will  not  describe  here  the  system  ENIGMA, 
being  a  detailed  description  already  available  in  [1,3,5], 
but  we  will  mention  some  points  necessary  to 
understand  how  the  system  has  been  applied. 

ENIGMA  receives  in  input  a  set  of  learning 
events  and  a  body  of  background  Imowledge  described  as 
a  Horn  theory  and  produces  in  output  a  structured 
knowledge  base  of  classification  rules.  The  peculiarity 
that  makes  this  system  suitable  to  deal  with  structured 
domains  is  that  the  learning  events  are  described  as 
vectors  of  items.  Each  item  is  in  turn  a  vector  of 
<attribute,value>  pairs  and  corresponds  to  a  part 
(subpattem)  of  a  concept  insiance. 

In  the  present  case  the  learning  events 
correspond  to  the  mechanalysis  data,  obtained 
through  Fourier  analysis,  which  are  collected  in  a 
table  of  the  type  described  in  Fig.  1. 

Inside  the  table  the  data  arc  arranged  into  groups 
of  three  rows;  each  group  corresponds  to  a  given 
support  and  each  row  to  one  spatial  direction 
(Horizontal,  Vertical  and  Axial).  A  first  group  of  two 
columns  (denoted  by  "Total  Vibration")  contains  the 
measures  of  amplitude  and  velocity  of  the  total 
vibration,  whereas  a  second  group  (denoted  by  "Fourier 
Analysis")  contains  the  measures  of  frequency,  velocity 
and  (possibly)  phase  of  the  harmonic  components  of  the 
vibration.  Notice  that,  for  the  harmonics,  the  measure 
of  the  velocity  v  (for  which  more  reliable  analyzers 
exist)  allows  the  amplitude  to  be  known  as  well,  being 
this  last  proportional  to  v  tlirough  ilie  (known)  w.  The 
measure  consists,  in  some  cases,  of  a  single  value  (the 
index  of  the  analyzer  is  stable),  whereas,  in  others,  of  a 
range  (the  analog  index  of  the  analyzer  oscillates 
between  two  extremes).  This  distinction  is  an  important 
factor  for  the  differential  diagnosis.  The  qualitative 
behavior  (stable,  unstable,  oscillating,  rotating,  fixed) 
of  the  vibration  phase,  when  observed,  is  denoted  by  a 
letter  attached  to  the  corresponding  frequency  value.  For 
instance,  the  "i"  occurring  in  the  figure  denotes  an 
instable  phase. 
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Fig.  1:  Organization  of  the  data  collected  during  a  mechanalysis. 


A  mechanalysis  table  is  described  to  ENIGMA 
by  supplying  an  item  for  each  non  empty  entry.  Each 
item  is  described  by  a  vector  of  attributes  characterizing 
the  support,  the  direction,  the  amplitude,  the  type  (total 
vibration  or  Fourier  Analysis),  the  value  of  the 
measure,  the  "normal"  value,  and  the  rotation  speed  of 
the  shaft.  These  attributes  correspond  to  what  in 
Explanation  Based  Learning  arc  called  "operational" 
predicates  (elementary  features).  Other  predicates  (higher 
level  features)  can  be  defined  using  a  Horn  theory.  In 
particular,  it  is  possible  to  let  ENIGMA  work  by 
applying  pure  EBG  [9],  if  a  complete  theory,  defining  a 
non  operational  description  of  the  concept,  is  given. 
However,  this  is  practically  difficult  to  achieve  and 
ENIGMA  was  provided  with  a  theory  that  only  defines 
high  level  features  capturing  contextual  information. 

The  rules  learned  by  ENIGMA  take  the  general 

form: 

w 

r:  Hj;  tp - »  Hj^  V  H]  (1) 

Rule  r  can  be  interpreted  as  follows.  Suppose  that  Hq  is 
the  set  of  all  classes  and  he  Hq  is  the  concept  to  be 
identified  in  a  given  event  f.  Suppose  moreover,  that, 
owing  to  some  reasoning  made  by  using  other  rules  (of 
this  type),  we  arrived  at  supposing  that  h  can  only 
belong  to  the  subset  Hj  c  Hq;  then,  if  the  assertion  tp 
is  verified  on  f,  the  rule  (1)  concludes  that  h  belongs  to 
H]^  with  probability  w  or  to  H|  with  probability  1-w. 
The  relations  Hj^nHj  =  0  and  Hj  d  H|^,  H|  always 
hold.  Hj  is  called  the  context  of  the  rule,  Hj;  the 
primary  implication  and  H]  the  secondary  implication. 
If  the  probability  is  w=],  (he  secondary  implication  is 
not  present.  As  a  special  case,  the  set  Hj^  may  consist 
of  a  single  concept  h. 

The  assertion  (p  is  a  first  order  logic  formula, 
expressed  with  operational  predicates  only.  Numerical 
quantifiers  such  as  Atlcast  n,  Atmost  n.  Exactly  n  and 
negation  are  also  allowed.  Usually  the  primary 
implication  of  a  rule  coincides  with  the  context  of 
another  formula.  Formulas  having  the  same  primary 


implication  are  implicitely  considered  as  OR-ed.  As  a 
consequence,  the  knowledge  base  learned  by  ENIGMA 
can  be  described  as  a  graph  of  rules.  ENIGMA  produces 
such  a  graph  by  searching  in  the  rule  space  using  a 
general-to-specific  strategy,  starting  from  the  most 
general  formula  truc-^HQ  and  generating  more  and  more 
specific  formulas  until  classification  rules  are 
discovered.  This  process  is  guided  by  statistical 
heuristics  which  trades  off  consistency  and  completeness 
criteria  and  by  the  background  knowl^ge  supplied  at  the 
beginning.  The  strategies  and  the  heurisucs  are  described 
in  [2,3, 4, 5]. 

4.  The  Learning  Set 

All  the  experiments  have  been  performed  using 
a  set  Fq  of  N=209  mechanalysis  tables  (examples), 
filled  by  an  experienced  domain  expert  and  referring  to 
diagnoses  of  motor-pumps. 

The  considered  faults  can  be  grouped  into  six 

classes: 

C]  =  Problems  in  the  joint 
C2  =  Faulty  bearings 
C3  =  Mechanical  loosening 
Ca  =  Basement  distortion 
C5  =  Unbalance 

Cq  =  Normal  operation  conditions 

However,  as  mentioned  in  Section  2,  these 
faults  rarely  occur  in  isolation  and,  even  then,  it  is  not 
always  possible  to  individuate  them  precisely.  Thus, 
not  all  the  N  examples  have  been  univoquely  classified 
by  the  human  expert,  but,  on  the  contrary,  the 
tliagnoses  generated  by  him  followed  the  taxonomy 
reported  in  Fig,  2.  According  to  this  diagnostic 
taxonomy,  the  following  intermediate  classes  have  been 
used  by  the  expert: 

C7  =  Shaft  misalignment  (C7  =  C1UC4) 

Cg  =  Problems  in  the  pump  (Cg  =  C2UC3UC5) 
C9  =  Problems  in  the  motor  (C9  =  C2UC3UC5) 
CjQ  =  Problems  in  the  machine  (CjQ  =  CguC9) 
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The  ambiguity  a(0  of  a  classified  example  f  is  Hj  will  be  said  more  specific  dial  a  diagnosis  Hj  iff  Hj 
the  minimum  number  of  classes  f  is  hypothesized  to  an  ancestor  of  Hj  in  the  diagnostic  taxonomy, 
belong  to;  for  instance,  an  example  f  classified  in  class 
Cg  has  an  ambiguity  a(0  =  3.  Moreover,  a  diagnosis 


Fig.  2  -  Diagnostic  taxonomy  of  the  motor-pump  faults 


Table  1 

Diagno.ses  generated  by  the  expert.  The  classes  correspond  to  the  taxonomy  in  Fig.  2 
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Obviously,  we  desire  that  the  automated  system 
be  at  least  as  specific  as  the  expert  was.The  ambiguity 
parameter  a  roughly  corresponds  to  the  amount  of 
efforts  required  to  exactly  individuate  the  cause  of  a 
single  fault.  In  fact,  the  higher  a,  the  greater  the 
number  of  components  that  need  to  be  examined.  For 
example,  if  faulty  bearings  in  the  pump  is  assessed, 
then  only  the  bearings  have  to  be  disassembled;  if, 
instead,  only  problems  in  the  pump  can  be 
hypothesized,  then  the  whole  pump  has  to  be 
dismounted. 

In  Table  I  the  expert's  classification  of  the  N 
examples  is  reported;  the  average  ambiguity  of  the 
examples  is  a  =  2.08.  Notice  that,  whereas  an  internal 
node  of  the  diagnostic  taxonomy  denotes  uncertainty 
about  the  right  choice  among  the  descendant  nodes,  the 
intersection  Cj  n  Cj  denotes  co-occurrence  of  both  Cj 
and  Cj:  for  this  reason,  the  ambiguity  of  the  diagnosis 
Cj  n  Cj  is  evaluated  as  the  minimum  between  those 
assigned  to  Cj  and  Cj.  Regarding  the  expert’s 
classification,  it  has  to  be  pointed  out  that,  in  many 
cases,  the  expert  was  actually  able  to  generate  a  less 
ambiguous  hypothesis  than  the  one  reported  in  Table  1. 
However,  he  judged  this  more  specific  diagnosis  as  out 
of  reach  for  a  system  not  acquainted  with  a  deeper 
understanding  of  the  domain;  then,  he  indicated  what,  in 
his  opinion,  was  an  acceptable  answer  for  a  prototype 
automated  system.  Moreover,  the  generation  of  a  more 
precise  diagnosis,  by  the  part  of  the  expert,  entrained  an 
error  rate  of  about  5%  (according  to  the  expert's 
subjective  estimate),  whereas  the  diagnoses  reported  in 


Table  I  are  the  most  specific  ones,  which  are  still  error 
free.  In  this  context,  we  say  that  an  error  occurred  when 
the  U'ue  class  is  not  included  in  the  set  of  the  proposed 
ones. 

5.  Results  with  the  Expert  System 
MEPS 

MBPS  is  a  prototype  expert  system  [6] 
developed  manually  by  means  of  interviews  with  the 
same  domain  expert  who  supplied  the  classified 
examples.  MBPS'  knowledge  is  represented  by  means  of 
rules  and  frames  and  contains  both  diagnostic  and 
structural  information.  The  chosen  implementation 
environment  is  the  GOLD- WORKS  shell  on  an  IBM 
Personal  Computer.  The  system  contains  about  290 
diagnostic  rules  and  70  structural  frames.  Its 
representation  language  is  a  first  order  logic  based 
language  with  an  associated  continuous-valued 
semantics.  The  process  of  designing  and  implementing 
the  MBPS  prototype  took  about  18  months,  12  of 
which  devoted  to  the  acquisition,  encoding  and 
mainiainance  of  the  knowledge  base. 

In  Table  II  the  results  obtained  from  the  expert 
system  MBPS  are  reported.  MBPS  performs,  on  a  given 
case,  an  evidential  reasoning  and  generates,  as  a  result, 
a  list  of  possible  faults,  ordered  according  to  decreasing 
value  of  evidence.  The  recognition  rate  has  been 
evaluated  in  two  ways,  a  pessimistic  and  an  optimistic 
one. 


Table  II 

Results  of  MBPS  on  the  examples  of  the  set  Fq 


Ambiguity 

Nu.'’ber  of  Cases 

The  best  hypothesis 
was  the  correct  one 

The  correct  hypothesis  was 
included  in  the  proposed  set 

1 

131 

122 

122 

2 

60 

49 

59 

3 

18 

1 

15 

18 

In  the  pessimistic  case,  only  the  best  scored  hypothesis 
hbest  has  been  considered;  if  hjjpjj  is  a  descendant,  in 
the  diagnostic  taxonomy,  of  the  node  corresponding  to 
the  diagnosis  given  by  the  expert  and  is  an  ancestor  of 
the  correct  diagnosis,  then  hjjg^^  is  considered  correct 
(the  system  gives  an  answer  of  equal  or  higher 
specificity  then  that  of  the  expert  and  conatins  the 
correct  class).  In  the  optimistic  case,  MBPS'  diagnosis 
is  considered  correct  if  at  least  one  hypothesis  with  the 
preceding  characteristics  is  included  in  the  set  of 
generated  hypotheses.  The  obtained  average  ambiguity 


is  a=1.46,  the  pessimistic  error  rate  is  q=0.86  and  the 
optimistic  error  rate  is  ti*=0.95. 

6.  Results  Obtained  Using  ENIGMA 

The  learning  experiments  with  ENIGMA  were 
performed  exploiting  the  incremental  abilities  of  the 
system  and  consisted  of  a  sequence  of  six  runs  (phases). 

li  the  first  phase,  60  cases,  randomly  chosen, 
were  used  as  learning  set  (LS)  and  the  remaining  149 
examples  as  test  set  (TS).  Afterwards,  20  examples. 


Integrated  Learning  in  a  Real  Domain  327 


randomly  selected  among  the  149,  were  used  to  update 
the  current  knowledge  base  and  the  129  left  over  ones 
acted  as  test  set.  Also  the  error  rate  on  the  (60+20) 
training  examples  has  been  computed  Notice  that  the 
previously  used  60  examples  are  test  examples  for  the 
knowledge  acquired  with  the  20  following  oncs.The 
whole  process  has  been  repeated  5  times,  by  randomly 
choosing  the  examples  to  be  added  in  each  phase,  but 
keeping  their  number  fixed,  and  the  average  results  arc 


reported  in  Table  Ill.  As  an  example,  wc  report  here  one 
of  the  rule  learned  by  ENIGMA: 

"If  the  shaft  rotating  frequency  is  Wq  and  the  harmonic 
at  Wq  is  reported  to  have  high  intensity  and  the 
harmonic  at  2wq  is  reported  to  have  high  intensity  in  at 
least  two  measurements,  then  the  example  is  an 
instance  of  one  of  the  classes  C  j ,  C4  or  Cg" . 


Table  III 

Results  of  the  automatic  system  ENIGMA  with  incremental  learning 


Phase  1 

Phase  2 

Phase  3 

Phase  4 

Phase  5 

Phase  5 

Number  of  added 
training  examples 

60 

20 

24 

21 

20 

22 

Cai'dinality  of 
the  Test  Set  (TS) 

149 

129 

105 

84 

64 

42 

Recognidon  rate 
Learning  Set  (LS) 

0.97 

0.99 

0.96 

0.98 

0.97 

0.96 

Recognidon  rate 
Test  Set 

0.93 

0.94 

0.93 

0.9 

0.9 

0.86 

Recognidon  rate 
compiete  set 

0.94 

0.95 

0.93 

0.95 

0.94 

0.94 

Ambiguity 

1.15 

1.24 

1  18 

1.18 

1.23 

1.19 

Number  of  rules  in 
the  Knowledge  Base 

39 

70 

128 

131 

142 

147 

_ 

7.  Discussion 

An  interesting  comment  about  the  results  of  the 
automatic  learning  is  that  the  performance  is  quite 
stable  by  varying  the  number  of  seen  examples  and 
eventually  slightly  degrades.  This  counter-intuitive 
phenomenon  is  due  to  the  fact  that,  because  of  the 


ambiguity  inherent  in  the  examples,  adding  new  ones  to 
the  training  set  docs  not  contribute  any  new 
information,  but,  on  the  contrary,  mixes  up  infer .  l 
from  different  faults.  In  fact,  by  denoting  by  a.i'  '' 
average  ambiguity  of  the  learning  examples  ac 
j-th  phase,  wc  notice  from  Table  IV  an  incremt. 
confusion  of  this  set,  which  is  lesponsible  for  the  aouve 
mentioned  phenomenon. 


Table  IV 

Ambiguity  of  the  learning  set  in  each  learning  phase.  The  last  column  refers  to  examples  of  classes  C  ]  -C5. 


Ambiguity  in  the 
added  Learning  Set 

Ambiguity  in  the 
global  Learning  Set 

Ambiguity  of  die  examples 
of  classes  Cl -C5 

Phase  1 

1.98 

1.98 

2.18 

Phase  2 

2.23 

2.04 

2.24 

Phase  3 

2.36 

2.11 

2.31 

Phase  4 

2.23 

2.13 

2.33 

Phase  5 

2.85 

2.23 

2.43 

Phase  6 

2.36 

2.25 

2.44 
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Table  V 

Comparison  between  ENIGMA  and  MEPS 


Ambiguity 

Recognition  rate 
on  complete  set 

Recognition  rate 
on  test  set 

Development 

time 

ENIGMA 

1.21 

0.95 

0.94 

18  months 

MEPS 

1.46 

0.95 

— 

4  months 

More  precisely,  the  main  confusion  arises  among 
the  five  fault  classes,  whereas  the  examples  in  class  C5 
(non  faulty  machine)  are  always  good  examples;  in  fact, 
both  the  expert  system  MBPS  and  the  knowledge 
generated  by  ENIGMA  never  confuse  among  faulty  and 
non  faulty  machines.  As  it  is  not  possible  to  select 
more  "clean"  examples  than  those  used  in  these 
experimentations,  the  only  way  to  further  increase  the 
KB  performances  seems  10  be  that  of  supplying  both 
MBPS  and  ENIGMA  with  a  deep  model  of  the  domain, 
allowing  a  more  complex,  but  more  subtle,  reasoning 
to  be  performed,  as  the  human  expert  docs.  The 
evolution  of  the  approach  toward  this  direction  is  under 
development  and  preliminary  encouraging  results  have 
already  been  obtained  [5,13,14]. 

A  second  point  to  be  investigated  is  the 
comparison  between  the  knowledge  base  acquired  by 
ENIGMA  and  that  of  MEPS.  Some  parameters  useful 
to  this  aim  arc  reported  in  Table  V. 

For  ENIGMA,  the  knowledge  base  acquired 
during  the  second  phase  has  been  actually  retained  and 
used.  As  one  can  see  from  Table  V,  the  performance  of 
the  two  systems  are  comparable,  but  the  knowledge 
automatic^ly  acquired  needed  much  less  time  and  efforts 
to  be  built  up.  Notice  that,  even  if  we  must  consider,  in 
general,  the  recognition  rate  on  the  test  set  as  the 
performance  measure,  it  is  more  fair,  in  this  case,  to 
compare  the  two  systems  on  the  basis  of  the 
recognition  rate  on  the  complete  set  of  209  examples. 
Even  though  this  discussion  could  seem  pointless, 
given  the  substantial  identity  of  the  two  values  (0.94 
versus  0.95)  and  the  fact  that  in  the  second  phase  the 
learning  events  are  only  80  over  209,  it  is  interesting 
from  a  methodological  point  of  view.  In  fact,  the 
considered  209  cases  (spanning  a  time  period  of  about 
25  years)  are  the  very  source  of  the  knowledge  of  the 
human  expert  who  supplied  the  MEPS  rules.  He  did  not 
have  a  teacher,  nor  there  was  expertise  available  in 
advance,  and  he  formed  his  expertise  by  handling  this 
set  of  cases  (according  to  his  own  statement).  Then, 
MEPS  knowledge  base  is  in  fact  tested  on  its  own 
learning  set.  It  is  also  interesting  to  notice  that  about 
1/3  of  the  rules  acquired  by  ENIGMA  was  coincident 
witli  corresponding  rules  in  MEPS. 


For  what  concerns  the  development  time,  an 
initial  phase  of  problem  mastering  was  common  to 
both  projects  (about  2  months)  and  a  further  month  was 
spent  in  preparing  and  memorising  the  data.  Afterward, 
the  manual  knowledge  acquisition  and  updating  lasted 
12  months,  whereas  ENIGMA  acquired  the  knowledge 
base  in  a  few  hours. 

To  make  the  acquired  knowledge  more 
understandable  to  the  expert,  a  domain  theory,  defining 
higher  level  features,  has  been  given  to  the  system.  An 
example  of  rules  in  this  domain  theory  is  the  following: 
"If  the  shaft  rotating  frequency  is  coq  and  a 
vibration  has  ficquency  (O  and  co  is  a  multiple  of 
(Oq,  then  to  is  a  HARMONIC  of  coq" 

The  process  of  defining  and  implementing  this  theory 
tooks  about  one  month  more. 

S.Conclusions 

In  this  paper  we  described  an  application  of  the 
learning  system  ENIGMA  to  a  real  problem  of 
mechanical  troubleshooting.  The  problem  was  a 
difficult  one,  due  to  the  complexity  and  high  degree  of 
noise  of  the  ^la  and  to  the  effort  required  for  choosing  a 
suitable  description  language  and  a  problem 
conceptualization. 

The  results  obtained  indicate  that  it  starts  to  be 
realistic  to  apply  machine  learning  techniques  to 
significant  problems,  largely  reducing  the  development 
time  of  expert  systems,  at  the  same  time  keeping  a 
performance  level  comparable  with  that  of  expert 
systems  developed  with  classical  methodologies. 
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Abstract 


This  paper  describes  a  generalization  of  Mitch¬ 
ell’s  version-space  approach  to  concept  learning 
that  significantly  extends  its  range  of  applicabil¬ 
ity.  The  key  idea  is  to  remove  the  central  idea  of 
a  version  space  being  the  set  of  concepts  strictly 
consistent  with  training  data,  and  allow  arbitrary 
sets  of  concepts,  however  generated,  as  long  as 
they  can  be  represented  by  boundary  sets.  Learn¬ 
ing  is  accomplished  with  version  space  intersec¬ 
tion,  rather  than  the  traditional  candidate-elimi¬ 
nation  algorithm.  Applications  of  the  learning 
method,  incremental  version-space  merging,  in¬ 
clude  learning  from  forms  of  inconsistent  data  and 
combining  empirical  and  analytical  learning. 


1  Introduction 


Concept  learning  can  be  viewed  as  a  problem  of  search 
[Simon  and  Lea,  1974;  Mitchell,  1978;  1982]— to  identify 
some  concept  definition  out  of  a  space  of  possible  defini¬ 
tions.  Mitchell  [1978]  formalizes  ftis  view  of  generaliza¬ 
tion  as  search  in  his  development  of  version  spaces.  He  de¬ 
fines  a  version  space  to  be  the  set  of  all  concept  definitions 
in  a  prespecified  language  that  correctly  classify  training 
data— the  positive  and  negative  examples  of  the  unknown 
concept.  Although  a  landmark  work,  it  was  limited  in  its 
underlying  assumption  that  the  desired  concept  definition 
will  be  strictly  consistent  with  all  the  given  data.  This  work 
generalizes  the  version-space  approach  to  concept  learning 
to  partially  overcome  this  assumption. 

The  paper  begins  with  a  brief  review  of  the  version-space 
approach  to  concept  learning,  followed  by  an  overview  of 
the  generalized  approach.  Summaries  of  two  different  ap¬ 
plications  of  tlie  resulting  learning  method  are  then  pre¬ 
sented:  emulating  and  extending  the  candidate-elimination 
algorithm,  and  combining  empirical  and  analytical  learn¬ 
ing.  A  third  application,  learning  from  data  with  bounded 
inconsistency,  is  presented  elsewhere  in  these  proceedings 
[Hirsh,  1990b].  The  paper  concludes  with  a  discussion  of 
computational  issues  for  the  approach. 


2  Version  Spaces  and  the 

Candidate-Elimination  Algorithm 

Given  a  set  of  training  data  and  a  language  in  which  the 
desired  concept  must  be  expressed  (which  defines  the  space 
of  possible  generalizations  concept  learning  will  search), 
Mitchell  [1978]  defines  a  version  space  to  be  “the  set  of  all 
concept  descriptions  within  the  given  language  which  are 
consistent  with  those  training  instances’’.  Mitchell  noted 
that  the  generality  of  concepts  imposes  a  partial  order  that 
allows  efficient  representation  of  the  version  space  by  the 
boundary  sets  S  and  G  representing  the  most  specific  and 
most  general  concept  definitions  in  the  space,  ilie  5-  and 
G-sets  delimit  the  set  of  all  concept  definitions  consistent 
with  the  given  data — the  version  space  contains  all  concepts 
as  or  more  general  than  some  element  in  S  and  as  or  more 
specific  than  some  element  in  G. 

Given  a  new  instance,  some  of  the  concept  definitions 
in  the  version  space  for  past  data  may  no  longer  be  con¬ 
sistent  with  the  new  instance.  The  candidate-elimination 
algorithm  manipulates  the  boundary-set  representation  of  a 
version  space  to  create  boundary  sets  that  represent  a  new 
version  space  consistent  with  all  the  previous  instances  plus 
the  new  one.  For  a  positive  example  the  algorithm  gen¬ 
eralizes  the  elements  of  the  5-set  as  little  as  possible  so 
that  tliey  cover  the  new  instance  yet  remain  consistent  with 
past  data,  and  removes  those  elements  of  the  G-set  that  do 
not  cover  the  new  instance.  For  a  negative  instance  the  al¬ 
gorithm  specializes  elements  of  the  G-set  so  that  they  no 
longer  cover  the  new  instance  yet  remain  consistent  with 
past  data,  and  removes  from  the  5-set  those  elements  that 
mistakenly  cover  the  new,  negative  instance.  The  unknown 
concept  is  determined  when  the  version  space  has  only  one 
element,  which  in  the  boundary  set  representation  is  when 
the  5-  and  G-sets  have  the  same  single  element. 

To  demonstrate  the  candidate-elimination  algorithm,  con¬ 
sider  a  robot  manipulating  objects  on  an  assembly  line.  Oc¬ 
casionally  it  is  unable  to  grasp  an  object.  The  learning  task 
is  to  form  rules  that  will  allow  the  robot  to  predict  when  an 
object  is  graspable.  To  make  the  example  simple,  the  only 
features  of  objects  that  the  robot  can  identify,  and  hence  the 
only  features  that  may  appear  in  the  learned  rules,  are  shape 
and  size.  An  object  may  be  shaped  like  a  cube,  pyramid, 
octahedron,  or  sphere,  and  may  have  large  or  small  size. 
Further  structure  to  the  shape  attribute  may  be  imposed  by 
including  in  the  robot’s  vocabulary  the  term  “polyhedron” 
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Figure  1:  Generalization  Hierarchies. 


for  cubes,  pyramids,  and  octahedra.  The  generalization  hi¬ 
erarchies  that  result  are  shown  in  Figure  1 .  Concept  dell.u- 
tions  take  the  form 

Size (X, small)  A  Shape (X, polyhedron) 

-4  Graspable (X) , 

which  will  be  abbreviated  to  “[small,  polyhedron].”  The 
language  is  assumed  to  be  sufficient  for  expressing  the  de¬ 
sired  concept,  and  the  data  are  assumed  to  be  consistent. 

The  first  object  the  robot  tests  is  graspable,  and  is  thus 
a  positive  example  of  the  target  concept.  It  is  a  small 
cube,  and  hence  the  initial  version  space  has  boundary 
sets  S={[small,  cube]}  and  G«{[any-size,  any-shape]}. 
The  second  object  on  Uie  assembly  line  cannot  be  grasp^, 
so  is  a  negative  instance.  It  is  a  small  sphere,  yielding 
new  boundary  sets  5={[small,  cube]}  and  G={ [any-size, 
polyhedron]} — the  only  way  to  specialize  the  G-set  to  ex¬ 
clude  the  new  instance  but  still  cover  the  5-set  element 
is  to  move  down  the  generalization  hierarchy  for  shape 
from  any-shape  to  polyhedron.  A  further  negative  in¬ 
stance,  a  large  octaliedron,  prunes  the  version  space  yet 
more,  to  5={[small,  cube]}  and  G={[any-size,  cube]; 
[small,  polyhedron]}.  The  new  G-set  now  has  two  ele¬ 
ments  since  there  are  two  ways  to  specialize  the  old  G-set 
to  exclude  the  new  instance  but  still  cover  the  5-set  ele¬ 
ment.  After  a  final,  positive  instance  that  is  a  small  pyramid, 
the  boundary  sets  become  5=G={[small,  polyhedron]}, 
yielding  the  final  generalization  that  all  small  polyhedral 
objects  are  graspable. 

3  Incremental  Version-Space  Merging 

There  were  two  principal  insights  in  Mitchell’s  work.  The 
first  was  to  consider  and  keep  track  of  a  set  of  candidate 
concept  definitions,  rather  than  keeping  a  single  definition 
deem^  best  thus  far.  The  second  insight  was  that  the  set 
of  all  concept  definitions  need  not  be  explicitly  enumerated 
and  maintained,  but  rather  the  partial  ordering  on  concepts 
could  be  exploited  to  provide  an  efficient  means  of  rep¬ 
resentation  for  the  space  of  concept  definitions.  The  key 
idea  in  this  work  is  to  maintain  Mitchell’s  two  insights,  but 
remove  its  assumption  of  strict  consistency  with  training 
data — a  version  space  is  generalized  to  be  any  set  of  concept 
definitions  in  a  concept  description  language  representable 
by  boundary  sets. 


Given  the  generalized  definition  of  version  spaces,  the 
candidate-elimination  algorithm  can  no  longer  be  used — it, 
too,  assumes  strict  consistency  with  data.  Thus  an  alterna¬ 
tive  incremental  learning  me^od  was  developed.  Rather 
than  basing  the  learning  algorithm  on  shrinking  the  ver¬ 
sion  space,  the  new  algorithm  is  instead  based  on  version- 
space  intersection.  Given  a  version  space  based  on  one  set 
of  information,  and  another  based  on  a  second  set  of  in¬ 
formation,  the  intersection  of  the  two  version  spaces  re¬ 
flects  the  union  of  the  sets  of  information.  Such  version- 
space  intersection  forms  the  basis  for  the  incremental  learn¬ 
ing  method,  called  incremental  version-space  merging,  de¬ 
veloped  as  part  of  this  work. 

The  algorithm  for  computing  the  intersection  of  two  ver¬ 
sion  spaces  is  called  the  version-space  merging  algorithm. 
It  computes  the  intersection  using  only  boundary-set  rep¬ 
resentations.  This  is  pictured  in  Figure  2.  Given  version 
space  VSi  with  boundary  sets  5i  and  Gi,  and  VSi  with 
boundary  sets  Si  and  Gi,  the  version-space  merging  algo¬ 
rithm  finds  the  boundary  sets  Smz  and  Gmi  for  their  inter¬ 
section,  VSi  n  VS2  (labeled  VSmi).  It  does  so  in  a  two-step 
process.  The  first  step  assigns  the  set  of  minimal  general¬ 
izations  of  pairs  from  5i  and  52  to  5in2>  and  assigns  the  set 
of  maximal  specializations  of  pairs  from  Gi  and  G2  to  Gin2. 
The  second  step  removes  overly  general  elements  from  5in2 
and  overly  specific  elements  from  Gmi.  In  more  detail:* 

1.  For  each  pair  of  elements  si  in  5i  and  S2  in  52  generate 
their  most  specific  common  generalizations.  Assign  to 
5jn2  the  union  of  all  such  most  specific  common  gen¬ 
eralizations  of  pairs  of  elements  from  the  two  original 
5-sets.  Similarly,  generate  the  set  of  all  most  general 
common  specializations  of  elements  of  the  two  G-sets 
Gi  and  Gi  for  the  new  G-set  Gjn2. 


2.  Remove  from  5in2  those  elements  that  are  not  more 
specific  than  some  element  from  Gi  and  some  element 
from  Gi.  Also  remove  those  elements  more  general 
than  some  other  element  of  5in2  (generated  from  a  dif¬ 
ferent  pair  from  5i  and  52).  Similarly  remove  from 
Gjn2  those  elements  that  are  not  more  general  than 
some  element  from  each  of  5i  and  Si,  as  well  as  those 
more  specific  than  any  other  element  of  Gin2. 


The  only  information  a  user  must  give  for  this  version- 
space  merging  algorithm  to  work  is  information  about  the 
concept  description  language  and  the  partial  order  imposed 
by  generality.  The  user  must  specify  a  method  for  deter¬ 
mining  the  most  general  common  specializations  and  most 
specific  common  generalizations  of  any  two  concept  defini¬ 
tions.  The  user  must  also  define  the  test  of  whether  one  con¬ 
cept  definition  is  more  general  than  another.  Given  this  in¬ 
formation  about  the  concept  description  language,  the  Iwo- 


SiCp  ^recess  auove  will  intersect  two  version  spaces,  yield¬ 
ing  the  boundary-set  representation  of  the  intersection. 

The  result  of  this  change  of  perspective,  from  version- 
space  shrinking  to  version-space  intersection,  is  that  each 
new  constraint  that  should  reduce  the  version  space  of  vi¬ 
able  concept  definitions  must  be  represented  as  a  version 
space  itself,  to  be  intersected  with  the  current  version  space 
of  candidates.  When  new  information  is  obtained,  incre¬ 
mental  version-space  merging  forms  the  version  space  of 


'Mitchell  [1978]  provides  an  equivalent  version-space  inter¬ 
section  algorithm. 
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Figure  2:  Version-Space  Merging. 


concept  definitions  that  are  relevant  given  the  new  infor¬ 
mation  and  intersects  it  with  the  version  space  for  past  in¬ 
formation.  This  is  pictured  in  Figure  3.  As  each  new  piece 
of  information  is  obtained,  its  version  space  is  formed  (ViSi) 
and  intersected  with  the  version  space  for  past  data  {VS„) 
to  yield  a  new  version  space  (VSn+i),  which  will  itself  be 
intersected  with  the  version  space  for  the  next  piece  of  in¬ 
formation. 

The  general  algorithm  proceeds  as  follows: 

1.  Form  the  version  space  for  the  new  piece  of 
information. 

2.  Intersect  this  version  space  with  the  version 
space  generated  from  past  information. 

3.  Return  to  the  first  step  for  the  next  piece  of 
information. 

The  initial  version  space  contains  ail  concept  descriptions  in 
the  language,  and  is  bounded  by  the  5-sct  that  contains  the 
empty  concept  that  says  nothing  is  an  example,  and  the  G- 
set  that  contains  the  universal  concept  that  says  everything 
is  an  example. 

The  Insight  from  which  the  generality  of  incremental 
version-space  merging  arises  is  that  the  specific  learning 
task  should  define  how  each  piece  of  information  is  to  be 
interpreted — what  the  version  space  of  relevant  concept 
definitions  should  be.  Use  of  incremental  version-space 
merging  requires  a  specification  of  how  the  individual  ver¬ 
sion  spaces  should  be  formed  in  the  first  step  for  each  it¬ 
eration.  Forming  the  version  space  of  concept  definitions 
consistent  with  each  instance  (i.e.,  those  concept  definitions 
that  correctly  classify  the  given  instance)  results  in  an  em¬ 
ulation  of  the  candidate-elimination  algorithm,  and  permits 
an  extension  that  handles  cases  of  ambiguous  data  (such  as 
when  the  color  attribute  of  some  instance  is  known  to  be 
either  brown  or  black  without  knowing  which).  Forming 
the  version  space  containing  concept  definitions  consistent 
with  the  explanation-based  generalization  of  data  provides 
a  way  to  integrate  empirical  and  analytical  learning.  These 
are  the  applications  discussed  in  this  paper.  Further  details 


are  presented  elsewhere  [Hirsh,  1989b;  1989a;  1990a]. 

A  further  application  of  incremental  version-space  merg¬ 
ing,  to  learn  from  inconsistent  data,  is  presented  elsewhere 
in  these  proceedings,  and  will  be  briefly  summarized  here. 
The  approach  taken  is  to  forego  a  solution  to  the  full  prob¬ 
lem  of  learning  from  inconsistent  data,  and  instead  solve 
a  subcase,  called  bounded  inconsistency.  Data  are  said  to 
have  bounded  inconsistency  when  some  small  perturbation 
to  the  description  of  any  b^  instance  will  result  in  a  good 
instance  (such  as  when  misclassifications  are  due  to  small 
measurement  errors).  When  this  is  true,  a  learning  sys¬ 
tem  can  search  through  the  space  of  concept  definitions  that 
correctly  classify  either  the  original  data,  or  small  pertur¬ 
bations  of  the  data.  Instance  version  spaces  will  contain 
all  concept  definitions  consistent  with  either  the  instance 
or  some  neighboring  instance  description.  The  resulting 
version  space  after  each  incremental  intersection  not  only 
contains  concept  definitions  that  correctly  classify  all  the 
training  data  (if  any  such  definitions  exist),  but  also  those 
that  miss  some  (or  even  all)  of  the  data  by  only  a  small 
amount.  Further  details  are  presented  elsewhere  [Hirsh, 
1989b;  1990bJ. 
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Figure  3:  Incremental  Version-Space  Merging. 


4  The  Candidate'EIimination  Algorithm: 
Emulation  and  Extensions 

Incremental  version-space  merging  should  maintain  the 
functionality  of  the  original  version-space  approach.  This 
section  demonstrates  how  the  candidate-elimination  algo¬ 
rithm  can  be  emulated  using  incremental  version-space 
merging,  and  furthermore  describes  an  extension  that  en¬ 
ables  learning  from  ambiguous  data  (such  as  when  features 
are  not  uniquely  identified). 

4.1  Emulating  the  Candidate-Elimination  Algorithm 

The  key  idea  for  emulating  the  candidate-elimination  algo¬ 
rithm  with  incremental  version-space  merging  is  to  form 
the  version  space  of  concept  definitions  strictly  consistent 
with  each  individual  instance  and  incrementally  intersect 
these  version  spaces  with  incremental  version-space  merg¬ 
ing.  The  results  after  each  incremental  intersection  can  be 
shown  to  be  the  same  as  those  after  each  step  of  the  candi¬ 
date-elimination  algorithm  [Hirsh,  1989b]. 
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Emulating  the  candidate-elimination  algorithm  with  in¬ 
cremental  version-space  merging  in  this  manner  requires 
forming  the  version  space  of  concept  definitions  consistent 
with  a  single  instance  in  boundary-set  representation.  This 
is  done  as  follows.  If  the  training  instance  is  a  positive  ex¬ 
ample,  its  5-set  is  assigned  the  most  specific  elements  in 
the  language  that  include  the  instance.  When  the  single¬ 
representation  trick  holds  (i.e.,  for  each  instance,  there  is 
a  concept  definition  whose  extension  is  only  that  instance 
[Dietterich  et  at.,  1982]),  the  5-set  contains  the  instance  as 
its  sole  element.  When  it  does  not  hold,  the  leaming-task- 
specific  method  that  generates  instance  version  spaces  must 
determine  the  set  of  most  specific  concept  definitions  in  the 
language  that  cover  the  instance.  The  new  G-set  contains 
the  single,  universal  concept  that  says  that  everything  is  an 
example  of  the  concept.  If  Uie  training  instance  is  a  negative 
example,  its  5-set  is  taken  to  be  the  single,  empty  concept 
that  says  that  nothing  is  an  example  of  the  concept,  and  its 
G-set  is  the  set  of  minimal  specializations  of  the  univer¬ 
sal  concept  that  don’t  cover  the  instance.  This  forms  the 
boundary-set  representation  of  the  version  space  of  concept 
definitions  consistent  with  a  single  training  instance. 

The  emulation  can  be  summarized  as  follows: 

1 .  Form  the  version  space  of  all  concept  defini¬ 
tions  consistent  with  an  individual  instance. 

2.  Intersect  this  new  version  space  with  the  ver¬ 
sion  space  for  past  data  (which  starts  as  the 
full  version  space  containing  all  concepts  in 
the  concept  description  language)  to  gener¬ 
ate  a  new  version  space. 

3.  Return  to  Step  1  for  the  next  instance. 

Note  that  this  merely  instantiates  the  general  incremental 
version-space  merging  algorithm  given  earlier,  specifying 
how  individual  version  spaces  are  formed.  Furthermore,  in 
contrast  to  the  candidate-elimination  algorithm,  this  emu¬ 
lation  allows  the  first  instance  to  be  negative  and  does  not 
assume  the  single-representation  trick. 

To  demonstrate  this  incremental  version-space  merging 
implementation  of  the  candidate-elimination  algorithm,  the 
robot  learning  task  presented  in  Section  2  will  again  be 
used.  The  initial  version  space  has  boundary  sets  5={0} 
and  G={[any-size,  any-shape]}  (where  0  represe  its  the 
empty  concept  that  says  nothing  is  an  example  of  the  target 
concept).  The  version  space  for  the  first  positive  instance,  a 
small  cube,  has  the  boundary  sets  5={[small,  cube]}  and 
G={[any-size,  any-shape]}  (Step  l  of  the  incremental 
version-space  merging  algorithm),  and  when  merged  with 
the  initial  version  space  simply  returns  the  instance  ver¬ 
sion  space  (Step  2).  This  is  obtained  using  the  version- 
space  merging  algorithm  (Section  3):  in  its  first  step  the 
most  specific  common  generalizations  of  pairs  from  the 
two  original  5-sets  are  formed — here  it  is  the  most  specific 
common  generalizations  of  0  and  [small,  cube]:  {[small, 
cube] }:  the  second  step  prunes  away  those  that  are  not  min¬ 
imal  and  those  not  covered  by  elements  of  the  two  orig¬ 
inal  G-sets,  but  here  nothing  need  be  pruned.  Similarly, 
for  the  new  G-set  the  most  general  common  specialization 
of  [any-size,  any-shape]  and  [any-size,  any-shape]  is 
[any-size,  any-shape],  and  nothing  need  be  pruned. 

Tbe  version  space  for  the  second,  negative  example,  a 
small  sphere,  is  defined  by  5={0}  and  G={  [large,  any- 


shape];  [any-size,  polyhedron]} — nothing  more  gen¬ 
eral  excludes  the  negative  instance.  When  merged  with 
the  previous  version  space,  the  new  boundary  sets  are 
5={[small,  cube]}  and  G={[any-size,  polyhedron]}. 
This  is  obtained  by  taking  for  the  new  5-sct  the  most 
specific  common  generalizations  of  [small,  cube]  and  0 
that  are  more  specific  than  [any-size,  any-shape]  and 
one  of  [large,  any-shape]  and  [any-size,  polyhedron]— 
i.e.,  covered  by  elements  of  the  two  original  G-sets. 
This  simply  yields  {[small,  cube]}.  For  the  new  G- 
set  the  most  general  common  specializations  of  [any- 
size,  any-shape]  and  [large,  any-shape]— {[large,  any- 
shape]} — and  the  most  general  common  specializations 
of  [any-size,  any-shape]  and  [any-size,  polyhedron]— 
{[any-size,  polyhedron]} — are  taken  for  the  new  G-set, 
but  [large,  any-shape]  must  be  pruned  since  it  is  not  more 
general  than  an  element  of  one  of  the  original  5-sets. 

The  third,  negative  example,  a  large  octahedron,  has 
boundary  sets  5={0}  and  G={  [small,  any-shape];  [any- 
size,  sphere];  [any-size,  cube];  [any-size,  pyramid]}. 
Merging  this  with  the  preceding  boundary  sets  yields 
5={[small,  cube]}  and  G={[any-slze,  cube];  [small, 
polyhedron]}.  Finally,  the  last,  positive  instance,  a  small 
pyramid,  h^  boundary  sets  5={[small,  pyramid]}  and 
G={[any-size,  any-shape]},  resulting  in  the  final  version 
space  5=G={  [small,  polyhedron]}. 

4.2  Ambiguous  Data 

When  an  instance  is  not  uniquely  identified,  it  is  said  to  be 
ambiguous.  For  example,  only  knowing  a  range  for  a  per¬ 
son’s  height  or  age  is  a  form  of  ambiguous  data.  More  ex¬ 
treme  examples  are  when  data  are  provided  at  too  general  a 
level  (such  as  only  knowing  that  someone  is  tall),  or  when 
attributes  are  totally  missing.  The  incremental  version- 
space  merging  emulation  of  the  candidate-elimination  al¬ 
gorithm  pro’  jcler  a  mechanism  for  doing  concept  learning 
even  when  gr  cn  ambiguous  data.  The  basic  idea  is  to  form 
the  version  space  of  concept  definitions  for  ambiguous  data 
by  identifying  the  set  of  ^1  concept  definitions  consistent 
with  any  potential  identity  for  the  ambiguous  instance:  its 
version  space  should  include  concept  definitions  consistent 
with  any  possible  interpretation  of  the  instance.  For  exam¬ 
ple,  if  a  positive  instance  is  known  to  be  either  black  or 
brown,  its  version  space  should  contain  all  concept  defi¬ 
nitions  that  include  the  black  case  plus  all  concept  defini¬ 
tions  that  include  the  brown  case.  This  version  space  can 
be  viewed  as  the  union  of  two  version  spaces,  one  for  black 
and  tlie  other  for  brown. 

Defining  a  version  space  requires  defining  the  contents 
of  its  boundary  sets.  For  ambiguous  training  data  this  is 
done  by  setting  the  boundary  sets  to  the  most  specific  and 
general  concept  definitions  consistent  with  some  possibility 
for  the  training  instance.  If  the  instance  is  a  positive  exam¬ 
ple,  the  5-set  contains  the  most  specific  concept  definitions 
that  include  at  least  one  possible  identity  for  the  ambiguous 
instance.  If  the  single-representation  trick  holds,  the  5-set 
contains  the  set  of  all  instances  that  the  training  instance 
might  truly  be.  The  G-set  contains  the  universal  concept 
that  matches  everything.  If  the  instance  is  negative  the  5- 
set  contains  the  empty  concept  that  matches  nothing  and  the 
G-set  contains  the  minimal  specializations  of  the  universal 
concept  that  do  not  include  one  of  the  possibilities  for  the 
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uncertain  data. 

The  algorithm  can  thus  be  summarized  as  follows: 

1.  (a)  Form  the  set  of  all  instances  the  given 

instance  might  be. 

(b)  Form  the  version  space  of  all  concept 
definitions  consistent  with  any  individ¬ 
ual  instance  in  this  set. 

2.  Intersect  the  version  space  with  the  version 
space  for  past  data  to  obtain  a  new  version 
space. 

3.  Return  to  Step  1  for  the  next  instance. 

This  is  again  just  the  algorithm  of  Section  3,  with  a  new 
specification  of  how  individual  version  spaces  are  formed. 
If  there  is  no  ambiguity,  the  algorithm  behaves  like  the  can¬ 
didate-elimination  algorithm  emulation  above. 

Note  that  ambiguous  data  cannot  be  handled  through  the 
use  of  internal  disjunction  [Michalski,  1983}.  An  example 
of  an  internal  disjunction  would  be  the  concept  definition 
[small,  octahedronvcube].  This  says  that  small  objects 
that  are  either  octahedra  or  cubes  will  be  positive.  Both 
small  octahedra  and  small  cubes  are  included  as  positive  by 
it.  An  ambiguous  instance,  on  the  other  hand,  cannot  guar¬ 
antee  that  both  will  be  positive;  it  may  be  that  only  small 
cubes  are  positive,  whereas  the  intern^  disjunction  would 
errantly  include  small  octahedra.  A  correct  solution  to  han¬ 
dling  ambiguous  data  must  not  rule  out  concept  definitions 
whose  extension  only  includes  one  of  the  possible  identities 
of  an  ambiguous  instance. 

To  demonstrate  how  ambiguous  data  are  handled,  the 
learning  task  of  Section  2  is  again  used.  If  the  first,  pos¬ 
itive  instance  were  known  to  be  small  and  either  cube 
or  octahedron,  the  instance  version  space  would  have 
boundary  sets  5={  [small,  cube];  [small,  octahedron]} 
and  G={[any-slze,  any-shape]}.  After  the  second  in¬ 
stance  (a  small  sphere,  negative  example)  is  processed, 
the  resulting  boundary  sets  are  5={  [small,  cube];  [small, 
octahedron]}  and  (?={[any-size,  polyhedron]}.  The 
third  instance  (a  large  octahedron,  negative  example)  re¬ 
sults  in  5={ [small,  cube];  [small,  octahedron]}  and 
G={[any-size,  cube];  [small,  polyhedron]}.  It  takes  the 
final  instance  (a  small  pyramid,  positive  example)  to  fi¬ 
nally  make  the  version  space  converge  to  5=G={  [small, 
polyhedron]}. 

5  Combining  Empirical  and  Analytical 
Learning 

The  previous  section  described  how  incremental  version- 
space  merging  can  be  used  to  emulate  and  extend  the 
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use  of  incremental  version-space  merging  to  implement  and 
extend  Mitchell’s  [1984]  proposal  for  combining  empirical 
and  analytical  learning.  The  key  idea  is  to  form  version 
spaces  consistent  with  the  results  of  explanation-based  gen¬ 
eralization  (EBG)  [Mitchell  et  al.,  1986],  rather  than  con¬ 
sistent  with  ground  data.  The  problem  addressed  is: 

Given: 

•  Training  Data:  Positive  and  negative  exam¬ 
ples  of  the  concept  to  be  identified.  Train¬ 
ing  data  are  expressed  within  an  instance  de¬ 


scription  language,  whose  terms  are  assumed 
to  be  operational. 

•  Concept  Description  iMnguage:  A  language 
in  which  the  final  concept  must  be  expressed. 

It  is  a  superset  of  the  instance  description 
language,  and  is  where  generalization  hier¬ 
archies  would  appear. 

•  Positive-Data DomainTheory  {optional):  A 
set  of  rules  and  facts  for  proving  that  an  in¬ 
stance  is  positive.  Proofs  terminate  in  ele¬ 
ments  of  ^e  instance  description  language. 

•  Negative-DataDomainTheory  {option^):  A 
set  of  rules  and  facts  for  proving  that  an  in¬ 
stance  is  negative.  Proofs  terminate  in  ele¬ 
ments  of  the  instance  description  language. 

Determine: 

•  A  set  of  concept  definitions  in  the  concept 
description  language  consistent  with  the  data. 

The  method  processes  a  sequence  of  instances  as  follows, 
starting  with  the  first  instance: 

1.  (a)  If  possible,  apply  EBG  to  the  current  instance  to 

generate  a  generalized  instance.  Do  so  for  all  pos¬ 
sible  explanations.  If  no  explanation  is  found, 
pass  along  the  ground  data. 

(b)  Form  the  version  space  of  all  concept  defini¬ 
tions  consistent  with  the  (perhaps  generalized)  in¬ 
stance.  If  there  are  multiple  explanations  include 
those  concept  definitions  consistent  with  any  sin¬ 
gle  explanation. 

2.  Intersect  this  version  space  with  the  version  space  gen¬ 
erated  from  all  past  data. 

3.  Return  to  the  first  step  for  the  next  instance. 

This  is  again  an  instantiation  of  the  general  incremental 
version-space  merging  algorithm  given  earlier. 

The  basic  technique  is  to  form  the  version  space  of  con¬ 
cept  definitions  consistent  with  the  explanation-based  gen¬ 
eralization  of  each  instance  (rather  than  the  version  space 
of  concept  definitions  consistent  with  the  ground  instance). 
The  version  space  for  a  single  training  instance  reflects  the 
explanation-based  generalization  of  the  instance,  represent¬ 
ing  the  set  of  concept  definitions  consistent  with  all  in¬ 
stances  with  the  same  explanation  as  the  given  instance. 
The  merging  algorithm  has  the  effect  of  up^ting  the  ver¬ 
sion  space  with  the  many  examples  sharing  the  same  expla¬ 
nation,  rather  than  with  the  single  instance.  In  this  manner 
irrelevant  features  of  the  instances  are  removed,  and  learn¬ 
ing  can  converge  to  a  final  concept  definition  using  fewer 
instances. 

The  technique  also  applies  to  cases  of  multiple,  compet¬ 
ing  explanations,  when  only  one  explanation  need  be  cor¬ 
rect.  In  such  cases  the  version  space  of  concept  definitions 
consistent  with  one  or  more  of  the  potential  results  of  EBG 
is  formed.  EBG  is  applied  to  every  competing  explanation 
of  an  instance,  each  yielding  a  competing  generalization  of 
the  instance.  The  space  of  candidate  generalizations  for  the 
single  instance  contains  all  concept  definitions  consistent 
with  at  least  one  of  the  competing  generalizations.  The  fi¬ 
nal  generalization  after  multiple  instances  must  be  consis¬ 
tent  with  one  of  them.  The  situation  is  similar  to  that  of  am¬ 
biguous  data  (Section  4.2),  only  here  it  is  unknown  which 
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explanation  is  correct  Like  the  earlier  treatment  of  ambigu¬ 
ous  data,  the  version  space  contains  all  concept  definitions 
consistent  with  at  least  one  of  the  possibilities. 

The  version  space  of  all  concept  definitions  consistent 
with  at  least  one  explanation-based  generalization  of  the 
instance  is  the  union  of  the  version  spaces  of  concept  def¬ 
initions  consistent  with  each  individual  explanation-based 
generalization.  For  positive  examples  this  union  has  as  its 
5  boundary  set  the  set  of  competing  explanation-based  gen¬ 
eralizations,  and  the  G  boundary  set  contains  the  universal 
concept  that  labels  everythingpositive.  If  one  result  of  EBG 
is  more  general  than  another  (e.g.,  one  mentions  a  super¬ 
set  of  the  training-instance  facts  mentioned  by  the  other), 
only  the  more  specific  result  is  kept  in  the  5-set.  Over 
multiple  instances  these  version  spaces  consistent  with  the 
explanation-based  generalizations  are  incrementally  inter¬ 
sected  to  find  the  space  of  concept  definitions  consistent 
with  the  analytically  generalized  data. 

The  approach  is  ^so  useful  given  theories  for  explaining 
negative  data,  when  the  system  is  provided  with  a  theory  ca¬ 
pable  of  expMning  why  an  instance  is  negative.  For  exam¬ 
ple,  in  search  control  an  example  of  a  state  in  which  an  op¬ 
erator  should  /tor  be  used  is  a  negative  instance,  and  a  theory 
for  explaining  why  the  instance  is  negative  would  analyze 
why  the  instance  is  negative— that  the  operator  does  not  ap¬ 
ply,  or  that  it  leads  to  a  non-optimal  solution.  This  theory 
is  then  used  to  generalize  the  negative  instance  to  obtain  a 
generalization  covering  all  instances  that  are  negative  for 
the  same  reason.  Incremental  version-space  merging  then 
uses  this  generalized  instance  by  setting  the  5-set  equal  to 
the  empty  concept  that  says  nothing  is  an  example  of  the 
concept,  and  setting  the  (?-set  equal  to  all  minimal  special¬ 
izations  of  the  universal  concept  that  do  not  cover  the  gen¬ 
eralized  negative  instance.  If  there  are  multiple,  competing 
explanations,  the  G-set  contains  all  minimal  specializations 
that  do  not  cover  at  least  one  of  the  potential  generalizations 
obtainable  by  EBG  using  one  of  the  explanations. 

Note  that  it  is  not  necessary  to  have  a  complete  theory  ca¬ 
pable  of  explaining  (and  generalizing)  all  correct  instances 
for  this  technique  to  work.  The  version  space  of  all  concept 
definitions  consistent  with  a  plain  non-generalized  instance 
— whether  negative  or  positive — can  always  be  formed.  In¬ 
stead  of  using  EBG,  the  version  space  consists  of  all  concept 
definitions  consistent  with  the  instance,  rather  than  its  ex¬ 
planation-based  generalization.  If  a  theory  for  only  explain¬ 
ing  positive  instances  exists,  negative  instances  can  be  pro¬ 
cessed  without  using  EBG.  If  an  incomplete  theory  exists 
(i.e.,  it  only  explains  a  subset  of  potential  instances),  when 
an  explanation  exists  the  version  space  for  the  explanation- 
based  generalization  of  the  instance  can  be  used,  other¬ 
wise  the  pure  instance  version  space  should  be  used.  When 
there  is  no  domain  theory  the  learner  degenerates  to  be¬ 
have  like  the  candidate-elimination  algorithm.  The  net  re¬ 
sult  is  a  learning  method  capable  of  exhibiting  behavior  at 
various  points  along  the  spectrum  from  knowledge-free  to 
knowledge-rich  learning. 

lb  illustrate  how  incremental  version-space  merging 
combines  empirical  and  analytical  learning,  two  examples 
are  presented.  The  first  demonstrates  how  empirical  learn¬ 
ing  generalizes  beyond  the  specific  results  obtainable  with 
EBG  alone.  The  second  demonstrates  how  empirical  learn¬ 
ing  deals  with  multiple  explanations. 


5.1  Cup  Example 

The  first  example  of  the  combination  of  incremental  version- 
space  merging  and  EBG  demonstrates  how  a  definition  of 
Cup  can  be  learned  given  incomplete  knowledge  about  cups 
plus  examples  of  cups.  It  is  bas^  on  the  examples  given  by 
Mitchell  et  al.  [1986]  and  Rann  and  Dietterich  [1990]. 

The  following  is  the  domain  theory  used  (written  in  Pro¬ 
log  notation): 

cup{X) : -hold3_liquid (X) , 

can_drink_from(X) , 
stable (X) . 

holds_liquid (X) ;-pyrex(X) . 
holds_liquid (X) : -china (X) . 
holds_liquid (X) : -aluminum  (X) . 
can_drink_f rom{X) :-liftable (X) , 

open_top (X) . 
lif table (X) : -small (X) . 
stable  (X)  :-flat_bottom(X)  . 

It  can  recognize  and  explain  some,  but  not  all,  cups.  The 
concept  description  language  used  for  this  problem  by  em¬ 
pirical  learning  utilizes  generalization  hierarchies,  includ¬ 
ing  the  knowledge  that  pyrex,  china,  and  aluminum  are  non- 
porous  materials,  and  that  black  and  brown  are  dark  colors. 
Note  that  this  information  is  not  present  in  the  domain  the¬ 
ory,  but  is  known  to  be  true  in  general.  Empirical  learning 
has  many  such  possible  generalizations.  The  goal  for  learn¬ 
ing  is  to  determine  which  potential  generalizations,  such  as 
those  mentioning  nonporous  material,  are  relevant. 
Learning  begins  with  the  first,  positive  example; 

china (oupl) . 
small (cupl) . 
open_top(cupl) . 
flat_bottom{cupl) . 
black (cupl) . 

EBG  results  in  the  rule 

cup (X> : -china (X) , 
small (X) , 
open_top{X) , 
flat_bottom(X)  . 

written  “[china,  small,  open,  flat,  anycolor]”  for  short. 
This  forms  the  5-set  for  the  version  space  of  the  first  in¬ 
stance  (and  the  first  step  of  incremental  version-space  merg¬ 
ing),  and  its  G-set  contains  the  universal  concept  [anyma- 
terial,  anysize,  anytop,  anybottom,  anycolor).  The  sec¬ 
ond  step  of  increment^  version-space  merging  intersects 
this  with  the  initial  full  version  space,  which  gives  back  this 
first-instance  version  space. 

Incremental  version-space  merging  then  returns  to  its  first 
step  for  the  next,  positive  instance,  which  is: 

pyrex (cup2) . 
small (cup2) . 
open_top(cup2) . 
flat_bottom(cup2) . 
brown (cup2) . 

EBG  results  in  the  rule 

cup (X) : -pyrex (X) , 
small (X) , 
open_top(X) , 
flat  bottom  (X)  . 
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The  5-set  for  this  instance's  version  space  contains  the  re¬ 
sult  of  EBG,  namely  (pyrex,  small,  open,  flat,  anycolor], 
and  its  (?-set  contains  the  universal  concept.  Merging  this 
with  the  version  space  for  the  first  iteration  yields  a  version 
space  whose  5-set  contains  [nonporous,  small,  open,  flat, 
anycolor]  and  whose  G-set  contains  the  universal  concept. 

The  final  instance  is  a  negative  example: 

aluminum (canl) . 
small (canl) . 
closed_top (canl) . 
flat_bottom(canl)  . 
white (canl) . 

For  this  example  the  theory  of  negative  data  is  assumed  to 
include  the  following  rules  (among  others): 

not_a_cup  (X)  :  - 

cannot_drink_f rom(X)  . 
cannot_drinlc_from(X) 
closed_top(X) . 

EBG  yields  the  following  rule: 

not_a_cup(X) ;-clo3ed_top(X) . 

This  is  then  used  to  determine  the  most  general  con¬ 
cept  definitions  that  exclude  this  generalized  case  uf 
not_a_cup,  namely  {(anymaterlal,  anysize,  open, 
anybottom,  anycolor]} ,  which  forms  Uie  G-set  of  the  ver¬ 
sion  space  for  this  third  instance.  The  5-sct  contains  the 
empty  concept. 

When  this  third  instance  version  space  is  merged  with  the 
result  of  the  previous  two  iterations  of  incremental  version- 
space  merging,  the  resulting  5-set  contains  [nonporous, 
small,  open,  flat,  anycolor]  and  the  resulting  G-set  con¬ 
tains  [anymaterlal,  anysize,  open,  anybottom,  any¬ 
color].  Note  that  the  domain  theories  have  done  part  of 
the  work,  with  the  coior  attribute  being  ignored  and  only 
the  third  attribute  being  deemed  relevant  for  the  negative 
instance,  but  empirical  learning  determining  nonporous. 
Further  data  would  continue  refining  the  version  sp^. 
However,  it  is  already  known  that  whatever  the  final  rule, 
it  will  include  small,  nonporous,  open-topped  objects,  like 
Styrofoam  cups,  which  the  original  theory  did  not  recognize 
as  cups. 

This  simple  domain  also  demonstrates  the  point  made 
earlier  about  the  technique  degenerating  to  look  like  pure 
empirical  learning.  Consider  the  same  examples,  only  with¬ 
out  the  domain  theory  present  At  each  iteration  the  re¬ 
sulting  version  space  would  be  exactly  the  same  as  would 
be  created  by  the  candidate-elimination  algorithm.  The  fi¬ 
nal  version  space  would  have  an  5-set  containing  [non¬ 
porous,  small,  open,  flat,  darkcolor],  and  a  G-set  con¬ 
taining  two  elements:  [anymaterlal,  anysize,  open,  any¬ 
bottom,  anycolor]  and  [anymaterial,  anysize,  anytop, 
anybottom,  darkcolor].  This  version  space  contains  more 
elements  than  the  corresponding  version  space  using  EBG. 

5.2  Can.put-on.table  Example 

As  an  example  of  using  domain  theories  with  multiple  com¬ 
peting  explanations  consider  the  following  domain  theory 
for  can_put_on_table: 

can_put_on_table (X) : -stable (X) , 

small (X) . 

canjput_on_table (X) : -stable (X) , 


light (X) . 

stable  (X)  :-flat_bottom(X)  . 

It  provides  two  potential  explanations  for  when  an  object 
can  successfully  be  placed  on  a  table:  the  object  must  be 
stable,  and  either  small  or  light  Given  an  unclassified  in¬ 
stance,  the  theory  cannot  be  used  to  predict  its  classifi¬ 
cation — whether  it  is  positive  or  negative— since  the  theory 
can  explain  too  many  things,  and  will  classify  some  poten¬ 
tially  negative  examples  as  positive.  However,  it/s  possible 
to  explain  a  positive  instance  once  its  classification  is  given. 
The  goal  for  learning  is  to  determine  a  definition  consistent 
with  the  data  plus  the  subset  of  the  theory  that  actually  mod¬ 
els  the  observed  data.  Furthermore,  when  there  are  multiple 
competing  explanations  of  a  given  positive  instance,  later 
instances  should  allow  determining  which  of  the  compet¬ 
ing  explanations  is  consistent  across  all  data. 

For  example,  given  a  can  as  a  positive  example  of  an  ob¬ 
ject  that  can  be  placed  on  a  table: 

flat_bottom(canl) . 

small (canl) . 

light (canl) . 

the  first  step  of  the  incremental  version-space  merging  pro¬ 
cess  uses  EBG  to  form  two  rules,  each  corresponding  to  a 
different  explanation: 

canjput_on_table (X) :- 
flct_bottom(X) , 
small (X) . 

can_put_on_table (X) : - 
flat_bottom(X) , 
light (X) . 

These  will  be  abbreviated  to  “[flat,  small,  anywelght]"  and 
“[flat,  anysize,  light]".  The  resulting  instance  version  space 
is  bounded  by  an  5-set  containing  these  two  concept  defini¬ 
tions  and  a  G-set  containing  the  universal  concept  that  says 
everything  is  an  example  of  the  concept.  Intersecting  this 
version  space  with  the  initial  version  space  that  contains 
all  concept  definitions  in  the  concept  description  language 
yields  the  instance  version  space  in  return. 

Returning  to  the  first  step  of  the  learning  process  for  the 
following  positive,  second  instance, 

flat_bottom(cardboard_boxl) . 

big (cardboard_boxl) . 

light (cardboard_boxl) . 

ERG  can  only  generate  one  rule: 

can_put_on_table (X) : - 
flat_bottom(X) , 
light (X) . 

This  results  in  an  instance  version  space  containing  [flat, 
anysize,  light]  in  the  ■S'-set  and  the  G-set  containing  the 
universal  concept.  Merging  the  two  instance  version  spaces 
(step  two  of  incremental  version-space  merging )  results  in 
an  5-set  with  the  single  element  [flat,  anysize,  light]  and 
the  G-set  containing  the  universal  concept. 

As  an  example  of  dealing  with  negative  data  with  no 
negative-instance  domain  theory,  consider  the  following 
negative  example  of  can_put_on_table: 

round_bottom(bowling_balll) . 

small (bowling_balll) . 

heavy (bowling_balll) . 
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The  version  space  of  concept  definitions  that  do  not  in¬ 
clude  it  has  an  5-set  that  contains  the  empty  concept  and 
a  G-set  that  contains  three  concept  definitions:  {[flat,  any- 
slze,  anyweight],  [anybottom,  large,  anywelght],  [any- 
bottom,  anysize,  light]}.  When  merged  with  the  ver¬ 
sion  space  for  past  data,  incremental  version-space  merg¬ 
ing  yields  a  version  space  whose  5  set  contains  [flat,  any¬ 
size,  light]  and  whose  G-set  contains  the  two  concept  def¬ 
initions  [flat,  anysize,  anyweight]  and  [anybottom,  any¬ 
size,  light].  Subsequent  data  would  further  refine  this  ver¬ 
sion  space. 

6  Computational  Complexity 

Previous  sections  have  described  incremental  version- 
space  merging  and  two  of  its  sqrplications.  This  section  ana¬ 
lyzes  the  computational  complexity  of  incremental  version- 
space  merging.  Incremental  version-space  merging  has 
two  major  steps:  version-space  formation,  and  version- 
space  intersection.  The  computadonal  complexity  of  each 
of  these  is  first  discussed,  followed  by  an  analysis  of  the 
complexity  of  the  overall  incremental  version-space  merg¬ 
ing  method.  Further  details  are  provided  elsewhere  [Hirsh, 
1989b]. 

6.1  Version-Space  Formation 

The  first  step  of  each  iteration  of  incremental  version-space 
merging  is  to  form  the  version  space  of  concept  definitions 
to  consider  given  the  current  piece  of  information.  Since 
this  occurs  at  each  step,  it  is  clearly  necessary  for  it  to  be  fea¬ 
sible  computationally.  A  general  analysis  of  the  complexity 
of  this  step  is  impossible — it  depends  both  on  the  specific 
method  for  generating  version  spaces  and  on  the  particu¬ 
lar  concept  description  language  being  used.  However,  it 
is  possible  to  do  this  analysis  for  specific  approaches,  and 
that  is  what  will  be  done  here.  The  particular  method  con¬ 
sidered  here  works  for  conjunctive  languages  over  a  fixed 
set  of  k  features,  and  handles  the  case  when  each  version 
space  is  effectively  the  union  of  n  version  spaces  (as  occurs 
with  ambiguous  data,  data  with  bounded  inconsistency,  and 
EBG  with  multiple  explanations).  The  analysis  applies  to 
the  applications  given  here,  as  well  as  for  learning  from  data 
with  bounded  inconsistency  [Hirsh,  1990b]. 

For  positive  instances,  unpruned  5-sets  have  n  elements. 
Pruning  nonminimal  elements  therefore  requires  at  most 
comparisons  of  relative  generality.  Each  comparison 
of  concept  definitions  requires  k  feature  comparisons.  For 
tree-structured  features,  each  feature  comparison  takes  time 
proportional  to  the  height  h  of  the  tree-structured  hierar¬ 
chy  for  that  feature.  However,  h  is  at  most  logn,  where 
V  is  the  maximum  number  of  values  any  feature  can  take 
on.'  Therefore  the  time  to  compare  two  concept  descrip¬ 
tions  takes  time  proportional  to  at  most  k  log  v.  Thus  com¬ 
puting  an  5-set  for  tree-structured  features  takes  at  most 
time  proportional  to  n^fclogv.  In  contrast,  when  features 
are  ranges  of  the  form  a  <x  <b,  feature  comparisons  take 
constant  time,  and  thus  computing  the  5-set  takes  time  pro¬ 
portional  to  at  most  n^k.  Ei^er  way,  as  long  as  forming  the 
set  of  possible  identities  is  tractable,  handling  positive  data 
is  tractable. 


^Of  course,  this  is  only  a  useful  bound  on  h  when  v  is  finite. 


For  negative  instances  there  are  nkb  elements  in  the  un¬ 
pruned  G-set,  where  b  is  the  number  of  ways  on  average  to 
specialize  a  feature  to  exclude  a  value.^  Pruning  nonmax- 
imal  elements  therefore  requires  (nkb)^  concept  compar¬ 
isons.  For  tree-structured  features  b  is  at  most  (w  - 1)  log  v, 
where  w  is  the  maximum  branching  factor  for  all  feature  hi- 
ermehies,  and  thus  computing  the  G  takes  time  proportional 
to  {nk{w  -  l)logv)^I:logv  =  (n(iu  -  l))^(I:logi;)^.  In 
the  case  where  features  are  ranges  of  the  form  a  <  x  < 
b,  with  the  range  of  x  discretized  to  a  fixed  set  of  values 
(such  as  measuring  values  to  the  nearest  millimeter  [Hirsh, 
1990b]),  b  <2,  and  thus  the  time  to  compute  a  G-set  is  pro¬ 
portional  to  at  most  In  both  cases,  this  is  again  feasi¬ 
ble  as  long  as  determining  the  n  possibilities  is  uractable. 

62  Version-Space  Merging 

The  second  step  of  incremental  version-space  merging  is  to 
intersect  two  version  spaces  using  the  version-space  merg¬ 
ing  algorithm.  The  complexity  of  version-space  merging  is 
again  dependent  on  the  particular  concept  description  lan¬ 
guage,  and  again  the  analysis  is  done  for  conjunctive  lan¬ 
guages  over  ib  features.  The  algorithm  computes  for  the 
new  5-set  the  most  specific  common  generalization  of  pairs 
from  the  two  5-sets  that  are  covered  by  some  element  of 
each  of  the  G-sets  but  not  by  some  other  element  of  the 
new  5-set,  and  does  a  symmetrical  process  for  the  G-sets. 

For  both  tree-structured  features  and  features  of  the  form 
a  <  »  <  6  the  minimal  common  generalization  of  two 
concept  descriptions  is  unique.  Therefore,  if  there  are  mi 
and  m2  elements  in  the  two  initial  5-sets,  there  are  at  most 
mim2  in  the  resulting  unpruned  5-set.  When  features 
are  tree-structured,  the  process  that  computes  the  minimal 
generalization  of  two  concept  definitions  takes  time  pro¬ 
portional  to  jfc  log^  V.  Computing  the  unpruned  5-set  thus 
takes  time  proportional  to  mimik  log^  v.  Computing  min¬ 
imal  elements  takes  an  additional  {m\mif'k  logu,  and  re¬ 
moving  elements  not  covered  by  some  G-set  element  takes 
mim2(ni  -I-  n2)k  log  v,  where  ni  and  nz  are  the  two  G-set 
sizes.  Thus  the  overall  complexity  is  proportional  to  at  most 

mxmzk  lo^v  {mimzf'k  logr;  -f  mvnz{n\  +  nz)k  logv. 
The  G-set  case  is  nearly  symmetric,  with  the  exception  that 
computing  the  maxim^  specialization  of  two  concept  def¬ 
initions  takes  time  proportional  to  at  most  Jblogv,  so  the 
overall  complexity  is  proportional  to  at  most  nin2fc  log  v  -f 
(ni  112)^1:  logv  nin2(mi  -f  m2)fclogv.  Whichever  is 
greater  will  be  the  overriding  term. 

When  features  are  of  the  form  a  <  x  <  b,  computing 
the  minimal  generalization  of  two  concept  definitions  takes 
time  proportional  to  k.  Computing  the  new  5-set  therefore 
takes  time  proportional  to  at  most  mimzk  +  {m\mzfk  •+ 
m\mz{n\  ■^■n2)k,  and  for  the  G-set  it  is  nmzk + (ni  n2)‘I:  -i- 
ninz{mi  +  5712)!:.  Again,  whichever  is  greater  is  the  over¬ 
riding  term. 

63  Incremental  Version-Space  Merging 

The  preceding  two  subsections  discussed  the  complexity  of 
the  individual  steps  taken  duringeach  iterationof  incremen- 

’if  &  is  infinite  (c.g.,  when  the  number  of  values  v  a  feature 
may  take  is  infinite),  the  use  of  version  spaces  inappropriate,  since 
boundary-set  sizes  must  be  finite. 
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tal  version-space  merging.  The  complexity  of  the  overall 
algorithm  given  a  set  of  instances  is  a  more  difficult  issue. 
It  depends  on  the  nature  of  the  concept  being  learned,  the 
concept  description  language,  and  the  particular  instances 
provided  as  data.  In  the  worst  case  the  process  will  be  ex¬ 
ponential  in  the  number  of  instances,  as  has  been  pointed 
out  by  Haussler  [1988]  for  the  candidate-elimination  algo¬ 
rithm,  which  is  subsumed  by  this  work. 

However,  in  some  cases  where  exponential  growth  could 
occur,  it  is  possible  to  order  the  data  so  that  exponential 
growth  is  avoided.  One  example  where  this  is  true  (for  con¬ 
sistent  data)  is  in  cases  where  after  processing  all  the  given 
data  the  resulting  boundary  sets  are  singleton.  In  such  cases 
the  data  can  always  be  ordered  (in  polynomial  time)  to  guar¬ 
antee  that  the  boundary  sets  will  remain  singleton  through¬ 
out  learning.  (The  ordering  algorithm  basically  processes 
positive  data  &st,  then  processes  negative  “near-misses” 
[Winston,  1975]  that  remove  selected  “don’t  cares.”)  This 
is  even  true  if  Haussler ’s  data  set  is  a  subset  of  the  full  data 
set.  This  is  an  area  of  current  work. 

Even  when  boundary  sets  do  not  grow  exponentially  in 
size  there  are  techniques  for  further  improving  the  perfor¬ 
mance  of  incremental  version -space  merging.  Two  have 
been  explored:  skipping  data  that  do  not  change  the  version 
space,  and  selecting  data  that  decrease  boundary  set  size. 
The  key  idea  in  the  first  case  is  to  note  that  instances  that 
are  classified  the  same  by  all  members  of  the  version  space 
will  not  change  the  version  space  (assuming  that  the  ver¬ 
sion  space  will  not  collapse  to  the  empty  set).  This  allows 
two  improvements  to  incremental  version-space  merging: 
whenever  one  instance  version  space  is  a  superset  of  an¬ 
other  the  first  instance  can  be  removed:  and  whenever  the 
version  space  for  the  current  instance  is  a  superset  of  the 
cunent  version  space  it  can  be  skipped. 

The  second  technique  is  based  on  the  observation  that, 
since  the  complexity  of  version-space  merging  is  a  func¬ 
tion  of  boundt^-set  size,  obtaining  small  boundary  sets  is 
a  good  idea.  Furthermore,  practice  shows  that  once  the 
boundary  sets  reach  small  size  they  typically  stay  small. 
T\vo  specific  heuristics  were  used  that  were  generally  suc¬ 
cessful  in  this  task;  selecting  instances  with  small  boundary 
sets,  and  selecting  instances  that  have  boundary  sets  with 
some  overlap  with  the  current  version-space  boundary  sets. 

7  Summary 

This  paper  has  presented  a  general  framework  for  concept 
learning  based  on  a  generalization  of  Mitchell’s  version- 
space  approach  that  removes  its  assumption  of  strict  con¬ 
sistency  with  data.  The  observation  c  .  which  incremental 
version-space  merging  is  based  is  that  concept  learning  can 
be  viewed  as  the  two-step  process  of  specifying  sets  of  rele¬ 
vant  concepts  and  intersecting  these  sets.  This  observation 
is  ultimately  the  major  contribution  of  this  work. 
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Abstract 

We  present  an  approach  to  modeling  the 
average  case  behavior  of  learning 
algorithms.  Our  motivation  is  to  predict  the 
expected  accuracy  of  learning  algorithihs  as 
a  function  of  the  number  of  training 
examples.  We  apply  this  framework  to  a 
purely  empirical  learning  algorithm,  (the 
one-sided  algorithm  for  pure  conjunctive 
concepts),  and  to  an  algorithm  that  combines 
empirical  and  explanation-based  learning. 

We  evaluate  the  average-case  models  by 
comparing  the  accuracy  predicted  by  the 
models  to  the  actual  accuracy  obtained  by 
running  the  learning  algorithms. 

1  Introduction 

Most  research  in  machine  learning  adheres  to  either 
a  theoretical  or  an  experimental  methodology 
(Langley,  1989).  Some  attempt  to  understand 
learning  algorithms  by  testing  the  algorithms  on  a 
variety  of  problems  (e.g.,  Fisher,  1987;  Minton, 
1987).  Others  perform  formal  mathematical  analysis 
of  algorithms  to  prove  that  a  given  class  of  concepts 
is  leamable  from  a  given  number  of  training 
examples  (e.g..  Valiant,  1984;  Haussler,  1987).  The 
common  goal  of  this  research  is  to  gain  an 
understanding  the  capabilities  of  learning 
algorithms.  However,  in  practice,  the  conclusions  of 
these  two  approaches  are  quite  different. 
Experiments  lead  to  findings  on  the  average  case 
accuracy  of  an  algorithm.  Formal  analyses  are 
typically  deal  with  distribution-free,  worst-case 
analyses.  The  number  of  examples  required  to 
guarantee  learning  a  concept  in  the  worst-case  do 
not  accurately  reflect  the  number  of  examples 
required  to  learn  an  accurate  concept  in  practice. 

We  have  begun  construction  of  an  average  case 
learning  model  to  unify  the  formal  mathematical  and 
the  empirical  approaches  to  understanding  the 


behavior  of  machine  learning  algorithms.  Current 
experience  with  machine  learning  algorithms  has 
lead  to  a  number  of  empirical  observations  about  the 
behavior  of  various  algorithms.  An  average  case 
model  can  explain  these  observations,  make 
predictions,  and  guide  the  development  of  new 
learning  algorithms. 

1.1  PAC  learning 

Valiant  (1984)  has  proposed  a  model  to 
probabilistically  justify  the  inductive  leaps  made  by 
an  empirical  learning  program.  The  probably 
approximately  correct  (PAC)  model  indicates  that  a 
system  has  learned  a  concept  if  the  system  can 
^arantee  with  high  probability  that  its  hypothesis  is 
approximately  correct.  Approximately  correct 
means  that  the  concept  will  have  an  error  no  greater 
than  e  (i.e.,  the  ratio  of  misclassified  examples  to 
total  examples  is  less  than  e).  The  learning  system 
is  required  to  produce  an  approximately  correct 
concept  with  probability  1-5.  For  a  given  class  of 
concepts,  the  PAC  model  can  be  used  to  determine 
an  upper  bound  on  the  number  of  training  examples 
required  to  achieve  an  accuracy  of  1-e  with 
probability  1-6.  The  PAC  model  has  led  to  many 
important  insists  about  the  capabilities  of  machine 
learning  algorithms.  However,  there  is  currently  a 
wide  gap  between  the  theoretical  results  and  the 
practical  results  of  running  learning  algorithms  on 
test  data.  In  particular,  Buntine  (1989)  has  argued 
that  the  Valiant  model  can  produce  overly- 
conservative  estimates  of  error  and  does  not  take 
advantage  of  information  available  in  actual  training 
sets. 

1.2  FAC  learning 

Recently,  Dietterich  (1989)  has  proposed  the 
frequently  approximately  correct  (FAC)  learning 
model.  This  learning  model  addresses  the  question 
of  how  frequently  a  learning  algorithm  acquires  a 
hypothesis  that  is  approximately  correct  on  a 
training  set  of  a  given  size.  The  frequency  of 
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correctness  is  with  respect  to  all  possible  training 
sets  of  the  given  size. 

Dietterich  has  run  three  common  learning 
programs  on  all  256  possible  concepts  of  three 
binary  features  and  found  that  best  algorithm  (llic 
one-sided  conjunctive  learning  algorithm,  (Haussler, 
1987))  can  frequently  (i.e,  for  90%  of  the  sets  of 
training  examples  consisting  of  four  of  the  eight 
possible  examples)  approximately  (with  accuracy 
greater  than  or  equal  to  87.5%)  learn  ten  of  these 
concepts.  Dietterich  has  also  calculated  an  upper- 
bound  on  the  number  of  concepts  that  are  FAC- 
leamable.  For  the  parameters  used  in  the 
ex^riments,  at  most  88  concepts  are  FAC-leamable. 
This  implies  that  either  the  upper  bound  is  too  high, 
or  the  current  generation  of  learning  programs  can 
be  improved  considerably. 

13  Mathematical  models  of  human  learning 

Several  psychologists  have  created  mathematical 
models  of  strategies  proposed  as  models  of  human 
or  animal  learning  (e.g.,  Atkinson,  Bower  & 
Crothers,  1965;  Restle,  1958).  These  models  have 
focused  on  average  case  behavior  of  learning 
algorithms.  There  are  several  reasons  that  these 
models  cannot  be  directly  applied  to  machine 
learning  algorithms.  First,  these  models  typically 
study  learning  algorithms  restricted  to  less  complex 
concepts  (e.g.,  single  attribute  discriminations)  than 
those  typically  used  in  machine  learning.  Second, 
these  models  also  address  complications  not  present 
in  machine  learning  programs  since  they  must 
account  for  individual  differences  in  human  learners 
caused  by  such  factors  as  attention,  motivation,  and 
memory  limitations.  Finally,  the  performance 
metric  predicted  by  these  models  is  generally  the 
expected  number  of  trials  required  to  learn  a  concept 
with  perfect  accuracy.  Typically,  experimental 
studies  of  machine  algorithms  report  on  the  observed 
accuracy  on  learning  accuracy  after  a  given  number 
of  examples. 

2  The  Average  Case  Learning  Model 

We  have  been  developing  a  framework  for  average 
case  analysis  of  machine  learning  algorithms.  The 
framework  for  analyzing  the  expected  accuracy  of 
the  hypothesis  produced  by  a  learning  algorithm 
consists  of  determining: 

»  Tile  conditions  under  which  the  aigoriihm 
changes  the  hypothesis  for  a  concept. 

♦  How  often  these  conditions  occur. 

•  How  changing  a  hypothesis  affects  the  accuracy 
of  a  hypothesis. 


Qearly,  the  second  requirement  presupposes 
information  about  the  distribution  of  the  training 
examples.  Therefore,  unlike  the  PAC  model,  the 
framework  we  have  developed  is  not  distribution- 
free.  Furthermore,  to  simpliiy  computations  (or 
reduce  the  amount  of  information  required  by  the 
model)  we  will  make  certain  independence 
assumptions  (e.g.,  the  probabilities  of  all  irrelevant 
features  occurring  in  training  example  are 
independent).  Similarly,  the  third  requirement 
presupposes  some  information  about  the  test 
examples.  We  will  make  the  same  simplifying 
assumptions  about  the  test  examples  as  the  training 
examples. 

Determining  conditions  under  which  a  learning 
algorithm  changes  a  hypothesis  requires  an  analysis 
of  the  operators  used  for  creating  and  changing 
hypotheses.  In  this  respect,  the  framework  is  more 
similar  to  the  mathematical  models  of  human  and 
animal  learning  strategies  than  die  theoretical  results 
on  machine  learning  algorithms  which  typically  are 
concerned  with  the  relationship  between  the  size  of 
the  hypothesis  space  and  the  number  of  training 
examples.  We  will  restrict  our  attention  to  learning 
algorithms  that  maintain  a  single  hypothesis  and 
incrementally  modify  the  hypothesis  when  the 
hypothesis  incorrecUy  classifies  examples. 

2.1  An  average  case  model  of  wholist 

We  will  first  show  how  the  framewo±  can  be 
applied  to  the  wholist  algorithm  (Bruner,  Goodnow, 
&  Austin,  1956)  a  pr^ecessor  of  the  one-sided 
algorithm  for  pure  conjunctive  concepts  (Haussler, 
1987).  Although  this  is  a  relatively  simple 
algorithm,  to  our  knowledge  this  is  the  first  time  an 
average  case  analysis  of  a  machine  learning 
algorithm  has  been  shown  to  predict  the  expected 
accuracy  in  sufficient  detail  that  it  can  be  compared 
to  observed  accuracy  obtained  by  running  the 
algorithm.  The  goal  of  this  analysis  is  to  predict  the 
probability  that  a  randomly  drawn  example  will  be 
classified  correctly  by  an  algorithm.*  'Ae  wholist 
algorithm  is  shown  in  Table  1.  This  algorithm 
incrementally  processes  training  examples.  The 
hypothesis  maintained  is  simply  the  conjunction  of 
all  features  that  have  appeared  in  all  positive  training 
examples  encountered.  The  analysis  will  assume 
that  &e  concept  can  in  fact  be  represented  as  a 
conjunction  of  features.  We  will  use  the  following 
notation  in  describing  the  algorithm: 


1.  Flann  &  Dietterich  (1989)  present  an  analysis 
of  a  component  of  lOE  that  is  similar  to  wholist. 
However,  their  analysis  calculates  the  number  of  training 
examples  required  to  learn  a  hypothesis  to  a  given 
accuracy. 
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Table  1.  The  wholic  ^  gurithm 

1.  Initialize  the  hypothesis  to  the  conjunction  of  all  features  that 
describe  training  examples. 

2.  If  the  new  example  is  a  positive  example,  and  the  hypothesis 
misclassif ies  the  new  example,  then  remove  all  features  from  the 
hypothesis  that  are  not  present  in  the  example. 

3.  Otherwise,  do  nothing. 


•  fj  is  the  y-th  irrelevant  feature  of  a  training 
example.  A  feature  is  irrelevant  if  the  true 
conjunctive  definition  of  the  concept  does  not 
include  the  feature. 

•  is  the  number  of  examples  (both  positive  and 
negative)  seen  so  far. 

The  following  information  is  required  to  predict 
the  expected  accuracy  of  wholist.  Note  that  while 
this  is  much  more  information  than  required  by  the 
PAC  model,  this  is  exactly  the  information  required 
to  generate  training  examples  to  test  the  algorithm; 

•  P  is  the  probability  of  drawing  a  positive 
training  example. 

•  I  is  the  number  of  irrelevant  features. 

•  F(fj)  is  the  probability  that  irrelevant  feature  J  is 
present  in  a  positive  training  example. 

Note  that  the  accuracy  of  this  algorithm  does  not 
depend  upon  the  number  of  relevant  features  that  are 
conjoined  to  form  the  true  concept  definition. 
Therefore,  the  analysis  does  not  make  use  of  the 
total  number  of  features  or  the  number  of  relevant 
features. 

The  wholist  algorithm  has  only  one  operator  to 
revise  a  hypc.aesis.  An  irrelevant  feature  is  dropped 
from  the  hypothesis  when  a  positive  training 
example  does  not  include  the  irrelevant  feature. 
Therefore,  if  i  positive  examples  have  been  seen  out 
of  N  tot^  training  examples,  the  probability  that 
irrelevant  feature  j  remains  in  the  hypothesis  (i.e.,  J 
has  appeared  in  all  i  positive  training  examples)  is 

P0^-)‘. 

The  hypothesis  created  by  the  wholist  algorithm 
misclassifies  a  positive  test  example  if  the 
hypothesis  contains  any  irrelevant  features  that  are 
not  included  in  the  test  example.  The  wholist 
algorithm  does  not  misclassify  negative  test 
examples  (provided  that  the  true  concept  definition 
can  be  represented  as  a  conjunction  of  the  given 
features).  Therefore,  the  probability  that  feature  j 
does  not  cause  a  positive  test  example  to  be 


misclassified  after  i  positive  training  examples  is 
given  by  ,i)  (where  nm  stands  for  not 

misclassified): 

=  1-(P(^-)‘*(1-P(^-)).  [1] 

If  ^  irrelevant  features  are  independent^,  then 
after  i  positive  training  examples,  the  probability 
that  no  irrelevant  feature  will  cause  a  positive  test 
example  to  be  misclassified  is: 

/ 

']^nmy,hoiistifj,i)  [2] 

i’inally,  in  order  to  predict  the  accuracy  of  the 
hypothesis  produced  by  the  wholist  algorithm  after 
N  training  examples,  it  is  necessary  to  take  into 
consideration  the  probability  that  exactly  i  of  the  N 
training  examples  are  positive  examples  for  each 
value  of  i  from  0  to  N.  Therefore,  the  accuracy  of 
the  wholist  algorithm  (where  accuracy  is  defined  as 
the  probability  that  a  randomly  drawn  positive 
example  will  be  classified  correctly  by  an  algorithm) 
can  be  given  by: 

r  I 

nmwhoiist(fj,i)  [3] 

1=0  _  y=i 

where  b{iJ^,P)  is  the  binomial  formula: 
6(i,/V,P)  =  (^  )*P'(1  -  P)(^-'). 

In  Figure  1,  the  predicted  and  the  actual 
accuracies  of  the  hypothesis  produced  by  the  wholist 
algorithm  are  plotted.  The  mean  accuracy  of  one 
hundred  runs  of  the  wholist  algorithm  is  plotted  as  a 
function  of  the  number  of  training  examples.  After 
every  two  training  examples,  the  accuracy  of  current 
hypothesis  was  measured  by  classifying  one  hundred 
positive  test  examples.  The  concept  to  be  learned 
was  constructed  from  a  set  of  ten  features.  Five  of 
these  features  are  irrelevant.  The  probability  that  a 
given  irrelevant  feature  was  present  in  a  positive 
training  example  ranged  from  5%  to  30%  (i.e.,  0.05 

2.  If  irrelevant  features  are  not  independent,  then 
the  average  case  learning  model  would  also  require 
conditional  probabilities  for  the  inelevant  features. 
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Figure  1:  A  comparison  of  the  expected  and  actual  accuracy  of  the  wholist  algorithm. 
The  x-axis  is  the  number  of  training  instances  and  the  y-axis  is  the  percent  of  test 
instances  correctly  classified.  The  boxes  represent  the  empirical  means,  the  y-bars 
the  95%  confidence  interval  around  those  means  and  the  curve  is  the  value  predicted 
by  Equations. 


^  V(fj )  ^  0.3).  In  this  simulation,  the  probability 
that  a  training  example  is  a  positive  example  is  40% 
(i.e.,  P  =  0.4).  The  expected  accuracy  of  wholist 
igorithm,  as  given  by  Equation  3  (under  the 
conditions  used  to  generate  the  training  data)  is 
plotted  along  with  the  results  obtained  by  running 
the  program. 

In  die  following  sections,  we  will  illustrate  how 
the  framevvoik  we  have  developed  for  average  case 
analysi.s  of  knowledge-free  conjunctive  learning 
algorithms  can  be  used  to  gain  insight  on  concept 
formation  in  the  presence  of  background  knowledge. 

Firet,  we  describe  a  performance  task  and  three 
learning  algorithms  that  can  be  used  to  acquire  the 
knowledge  necessary  to  accomplish  that  task.  Next, 
we  present  an  average  case  analysis  of  each 
algorithm.  Finally,  in  order  to  evaluate  the  average 
case  analysis,  we  compare  the  predicted  and 
observed  accuracy  of  the  algorithms  on  this  task. 

2.2  Learning  a  domain  theory. 

The  learning  and  performance  tasks  to  be  analyzed 
differ  from  those  commonly  studied  in  machine 
learning.  The  difference  is  necessitated  by  the  fact 
that  we  are  interested  in  analyzing  the  use  and 
learning  of  a  domain  theory  for  explanation-based 
learning  (Dejong  &  Mooney,  1986:  Mitchell. 
Keller,"^  &  Kedar-Cabelli,  1986).  Learning  a 
domain  theory  requires  learning  multiple  concepts 
(one  for  each  rule  in  the  domain  theory).  The 
perfomiance  task  is  to  infer  if  a  predicate,  p^,  is  true 
in  a  world  w  given  that  a  predicate  Pj  is  true.  We 


will  assume  that  the  world  can  be  represented  as  a 
set  of  binary  features  x,,  x^  . , .  x„.  In  this  paper, 
variables  starting  with  w  will  be  used  to  refer  to 
specific  instances  of  worlds;  variables  starting  with 
G  will  be  used  to  refer  to  general  descriptions  of  a 
class  of  worlds.  We  will  assume  that  each  g  may 
be  represented  as  a  conjunction  of  a  subset  of  the 
features  used  to  describe  specific  instances  of 
worlds. 

Training  examples  are  represented  as  specific 
instances  of  inference  rules  of  the  form’  p,(w)  -> 
p^tw).  The  knowledge  acquired  by  the  learning 
system  and  used  by  the  performance  system  is 
represented  as  inference  rales  of  the  form  Pj  (g)  -> 
Pj  (G) .  These  rales  may  be  read  as  “if  Pj  is  true  in 
a  world  g,  then  p^  is  true  in  g.”  g  represents  the  set 
of  conditions  under  which  p,  implies  Inference 
rales  of  this  form  may  be  learned  by  generalizing  all 
of  the  individual  worlds  in  which  p^  implies  Pj.  A 
collection  of  such  inference  rales,  learned  from  a 
variety  of  examples,  may  serve  as  the  domain  theory 
for  explanation-based  learning. 

In  this  paper,  we  consider  the  case  in  which 
there  are  two  possible  means  to  determine  if  a 

3.  One  possible  situation  in  which  training 
examples  of  this  kind  occur  is  in  the  induction  of  causal 
rules.  In  this  case,  a  training  of  the  form  Pj(w)  -»  p^tw) 
may  represent  the  situation  in  which  an  action  Pj  was 
observed  to  occur  in  w  and  a  teacher  indicates  that  Pj  is 
an  effect  of  Pj. 
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predicate  c  is  true  in  world  w  when  predicate  a  is 
true.  The  first  is  to  acquire  a  rule  (a  (G^j) 
that  allows  c  to  be  inferred  directly.  The  second  is 
to  acquire  two  rules  (a  (g^)  b  <g^)  and  b  (g^) 
c(Gb<.))  and  allow  the  performance  system  to 
chaining  rules  to  infer  b  from  a,  and  then  infer  c 
from  b .  Note  tliat  g^^^  can  also  be  represented  as  g^ 
A  Ggg.  For  example,  a  might  represent  striking  an 
object,  b  might  represent  the  object  breaking  and  c 
might  represent  the  owner  of  the  object  getting 
angry.  The  goal  of  the  learning  is  to  be  able  to 
predict  when  a  (g^j.)  ^  person  strikes 

an  expensive  fragile  object,  then  the  owner  of  the 
object  will  get  angry.”  Two  other  rules  may  help  in 
the  prediction:  “If  a  fragile  object  is  stmck,  the 
object  will  break”  and  “If  an  expensive  object 
breaks,  the  owner  will  be  angry.”  Here,  g^^  refers  to 
the  conditions  “expensive  and  fragile”,  g^  is  the 
condition  “fragile”  and  g^^  is  the  condition 
“expensive”. 

Three  distinct  groups  of  training  examples  are 
intermixed  and  presented  incrementally  to  the 
learning  system: 

a(W^) 

a(W^)  -4b(W^) 

b(Wg5)  ->c(Wb5) 

where  and  Wgj  are  the  sets  of  features  of 

training  examples.  We  will  refer  to  the  first  type  of 
training  examples  as  performance  examples  since 
these  examples  will  permit  the  learning  system  to 
learn  a  rule  that  directly  enables  the  performance 
task.  We  will  refer  to  the  latter  two  types  of 
examples  as  foundational  examples  because  these 
examples  do  not  allow  the  problem  to  be  solved 
direefly  but  provide  a  foundation  for  inferring  when 
c  is  tme."*  We  will  assume  that  g^,  g^^  and  g^^  can 
be  represented  as  pure  conjunctive  concepts.  In  this 
case,  one  means  of  learning  g^^,  g^^,  and  g^^  is  to  find 
the  maximally  specific  conjunction  of  all  examples 
of  and  respectively. 

We  will  consider  three  related  learning  methods 
that  can  be  used  to  acquire  the  knowledge  for  this 
performance  task: 

•  wholist:  The  wholist  method  can  be  used  to 
learn  g,^.,  from  performance  examples.  Note  that 
this  method  ignores  the  foundational  examples. 


4.  Note  that  the  classification  of  a  training 
example  as  a  performance  or  foundational  example  is 
with  respect  to  a  specific  performance  task. 


•  chaining:  The  wholist  method  can  be  used  to 
learn  g^  and  g^;.  from  foundational  examples 
and  c  can  be  inferred  from  a(G^)  ->  b(G^) 
and  h(G^)  c(Ggj).  Note  that  this  method 
ignores  the  performance  examples.  This  method 
can  be  viewed  as  learning  the  domain  theoiy  for 
explanation-based  learning.  However,  it  is 
irrelevant  to  the  accuracy  of  the  results  of  the 
learning  whether  the  rule  (a  (g^^.)  ->  c  (g^)  )  is 
cached  by  EBL  (and  updated  whenever  the 
domain  theory  is  changed)  or  the  performance 
task  is  solved  by  chaining. 

•  lOSC-TM  (Sarrett  &  Pazzani,  1989a):  The 
result  of  learning  g^j.  from  performance 
examples,  and  the  result  obtained  by  learning 
G^  and  Gg^  from  foundational  examples  can  be 
combined  to  form  a  composite  hypothesis.  This 
is  the  technique  used  by  the  integrated  one-sided 
conjunctive  learning  algorithm  with  truth 
maintenance^  (lOSC-TM).  If  a  feature  is  n 
included  in  either  the  hypothesis  found 
chaining  or  the  hypothesis  learned  for  g^j.,  it  is 
not  included  in  the  composite  hypothesis.  It  is 
possible  to  combine  the  result  of  chaining  and 
the  result  of  learning  g^  empirically  because 
each  hypothesis  only  makes  one-sided  errors. 
The  composite  hypothesis  formed  by  lOSC-TM 
.is  similar  to  the  S  set  of  the  version  space 
merging  algorithm  (Hirsh,  1989).  However, 
unlike  version-space  merging,  lOSC-TM  learns 
the  domain  theory  from  foundational  examples. 
Unlike  Ihe  previous  two  algorithms,  lOSC-TM 
can  update  its  hypothesis  when  presented  with 
performance  or  foundational  examples. 

Table  2  shows  an  example  of  the  hypothesis 
produced  by  each  of  these  three  algoritlims  when  run 
on  the  same  training  examples.  Note  that  the 
hypothesis  formed  by  lOSC-TM  contains  only  those 
features  that  are  in  the  hypothesis  formed  by  wholist 
on  performance  examples,  and  in  the  hypothesis 
formed  by  chaining  together  foundational  mles. 

The  analysis  of  the  wholist  algorithm  presented 
in  Section  2.1  will  apply  directly  to  this  problem. 
Since  wholist  does  not  modify  its  hypothesis  for  g^ 
when  presented  with  foundational  examples,  P  can 


5.  Truth  maintenance  implies  that  the  composite 
hypothesis  is  updated  immediately  when  a  foundational 
example  changes  the  hypothesis  of  a  foundational  rule, 
lose  without  huth  maintenance  waits  until  a 
performance  example  is  misclassified  before  updating  the 
composite  hypothesis. 
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Table  2.  Hypotheses  produced  by  the  three  algorithms 


Training  Examples 


a  (Xj- 

1,  Xj- 

1 ,  Xj" 

1,  Xj- 

1) 

-» 

C  (Xj- 

1  /  Xj™ 

1,X3- 

1/X^" 

1,X5- 

1) 

a  (Xj- 

1/Xj- 

0,X3- 

1  /  ^4“ 

l/Xj- 

1) 

b  (Xj- 

1  f  Xj* 

0,X3- 

1*  f  ^4“ 

l,Xj" 

1) 

b(Xj- 

1,X2= 

1  ,  X3“ 

l/Xj- 

0) 

c  (Xj- 

1  /  Xj* 

1  ,  Xj" 

1  f  Xj" 

0) 

a  (Xj- 

IjXj- 

1  ,  X3“' 

0,x^- 

0,  Xj- 

1) 

-> 

b{Xj“ 

l/Xj* 

IfXj- 

0,x^- 

O/Xj" 

1) 

b(Xj- 

IfXj- 

1/X3“ 

3-  f  ^4" 

Oj  Xj- 

1) 

C  (X3- 

1/Xj- 

1,  Xj" 

l,x^- 

O/X5" 

1) 

a  (Xj- 

1,  Xj- 

1  i  X3** 

1  f 

0,  Xj- 

0) 

C  {Xj- 

l^Xj- 

1/  Xj" 

1  f  X^" 

0,X5" 

0) 

Hypotheses 

wholist:  a(Xi-  1,X2-  l.Xj-  1)  c(Xj-  1,X2-  1,X3-  1) 

chaining:  a(Xi-  i,x5-  i)  -»  b(x,=  i/Xj-  i) 

b(Xj-  l/Xj-  IfXj-  1)  ->  o(Xj»  l/Xj-  IfXj-  1) 

(implicit)  a  (Xj-  1,X2=  l/Xj-  l/Xj-  1)  — >  c  (Xj-  IfXj-  IfXj-  l/Xj-  1) 

lOSC-TM:  a  (Xj=  l/Xj-  1)  —>  b(Xj“  IfXj-  1) 

b(Xj-  l/Xj-  l/Xj-  1)  ->  C  (Xj»  l,Xj-  1,X3“  1) 

a  (Xj-  l,Xj-  1)  ->  c(Xj-  l.x^-  1) 


be  viewed  as  the  probability  that  a  training  example 
is  a  positive  perfonnance  example  and  1-P  can  be 
viewed  as  the  probability  that  a  training  example  is  a 
foundational  example.  Note  that  we  are  interested  in 
predicting  the  accuracy  of  the  learning  algorithms  as 
a  fimction  of  the  total  number  of  examples  (both 
foundational  and  performance). 

2.3  Average  case  analysis  of  using  chaining 

Chaining  a  (G^)  -4b(G^)  andhtG^^)  otG^c) 
requires  using  the  wholist  algorithm  to  leam  two 
conditions  (g^  and  g^^).  A  positive  test  example 
will  be  correctly  classified  (i.e.,  for  a  given 
determining  if  a  -»  c  if  both  g^^^  and  g^^ 
do  not  contain  any  irrelevant  feature  that  is  not 
present  in  the  test  example.  Some  new  notation  is 
necessary  to  express  the  probability  that  a  positive 
test  example  is  classified  correctly  by  chaining: 


^AC’  ^AB’  and  ^BC  are  the  probabilities  of 
drawing  a  positive  training  example  from 

a(w^,)  ->  c(w^<.),  adv  ->b(w^3),and  btWj,.) 

^7  A  ui  Lriwii-^y  iiviv  wu 

only  consider  the  case  that  +  P^jj  +  Pg^^  = 
1  (i.e.,  there  are  no  negative  training  examples). 


*  ^AB^)’  probabilities 

that  irrelevant  feature  j  is  present  in  a  positive 


training  example  from  atw^^^)  ->c(w^g),  a(w^) 

-4b(w^)  ,and  b(Wgj)  -»c(Wec),  respectively. 

If  ly  positive  training  examples  of  atw,^) 
b  (Wj^)  ,  and  12  positive  training  examples  of  b  (Wg,,) 
->  ctWg^,)  have  been  seen  out  of  N  total  training 
examples,  the  probability  that  irrelevant  feature  J 
remains  in  the  hypothesis  is: 

remainchaining(f/rh'^= 

l-((l-PAB(ry)'0*(l-PBc(fy)‘^))  [4] 

Therefore,  the  probability  that  feature  j  does 
not  cause  a  positive  test  example  to  be  misclassified 
isgivenby«m^;^,.„,.„^(^.,iy,i2): 

'^'^chaining^j'  - 

1  -  ij,i2)  *  d  -  [5] 

If  all  irrelevant  features  are  independent,  then 
the  probability  that  no  irrelevant  feature  will  cause  a 
positive  test  example  to  be  misclassified  is: 

/ 

J_  [  f^f^chainin^jyhJT) 

7=1 

Finally,  in  order  to  predict  the  accuracy  of  the 
hypothesis  produced  by  chaining  after  N  training 
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S  E  .  ■  „  ^  PAC*PAB''PBcf''<‘>**>  *  n  >«»cHalmngj,h.N-mi))  [7) 


examples,  it  is  necessary  to  take  into  consideration 
the  probability  that  there  are  exactly  Iq  positive 

training  examples  of  a(w^j)  ^c(w^),  ij  positive 
training  examples  of  acw^^)  ->  b(w^)  ,  and  12 
positive  training  examples  of  b  (Wg^,)  ->  c  for 
each  value  of  Iq,  i j  and  12  from  0  to  N.  Note  that 
because  there  are  no  negative  training  examples,  12 
is  equal  to  N  -(iq,  +  ij ).  The  multinomial  formula  is 
used  to  weight  the  value  of  nf^chaining^j 
probability  that  various  values  of  Iq,  ij  and  12  occur. 

Therefore,  the  accuracy  of  the  hypothesis  produced 
by  chaining  can  be  given  by  Equation  7. 

In  Sarrett  &  Pazzani  (1989b),  we  consider  the 
general  case  in  which  the  inference  chain  is  of  any 
given  length  and  prove  the  general  forms  of  the 
equations  in  this  paper.  In  Section  2.5,  we  illustrate 
how  well  Equation  7  models  the  accuracy  of  the 
hypothesis  produced  by  the  chaining  algorithm. 

2.4  Average  case  of  analysis  of  lOSC-TM 

The  lOSC-TM  algorithm  combines  the  hypothesis 
fonned  by  chaining  a  (G^)  andbtGg;.) 

c{G^)  and  the  hypothesis  produced  by  wholist 
learning  a  -» c  (g^)  .  A  positive  test  example 
will  be  incorrectly  classified  if  it  does  not  contain 
an  irrelevant  feature  that  meets  both  of  the  following 
conditions: 

♦  Either  G^  or  Gg(.  contains  the  irrelevant  feature. 

•  G^j  contains  the  irrelevant  feature. 

If  Iq  positive  training  examples  of  a  (w^^) 
c  ,  ly  positive  training  examples  of  a  (w^)  -> 
b(w^)  ,  and  i2  positive  training  examples  of  biw^^) 
ciWg^)  have  been  seen  out  of  N  total  training 
examples,  the  probability  that  irrelevant  feature  j 
remains  in  the  hypothesis  is: 

N 


remainiQscVj.ioh'h> 

PAc(fi)^  *  (1-((1-Pab(^;)‘0  *  (l-Pfid^y)'^))))  [8] 
Therefore,  the  probability  that  feature  j  does  not 
cause  a  positive  test  example  to  be  misclassified  by 
lOSC-TM  is  given  by  nmjQsc-T\/f/0’ 

1  -  (,remainiQsc(fjJo>  '  Pac^^))) 

In  order  to  predict  the  accuracy  of  the  hypothesis 
produced  by  chaining  after  N  training  examples,  it  is 
necessary  to  take  into  consideration  the  probability 
that  there  are  exactly  Iq  positive  training  examples 

of  =  H  positive  training  examples 

of  a(w^)  ->  b(w^)  ,  and  I2  positive  training 

examples  of  b  (w^^)  ->  c  (Wa^)  for  each  value  of  i^, 
ij  and  <2  from  0  to  N.  As  with  chaining,  12  is 
equal  to  N  -(i^,  +  ly ).  Therefore,  the  accuracy  of 

the  hypothesis  produced  by  lOSC-TM  can  be  given 
by  Equation  10. 

2.5  Evaluation  of  the  average  case  model 

In  order  to  compare  the  accuracy  of  the  hypotheses 
produced  by  the  three  algorithms  under  a  variety  of 
conditions,  we  substituted  various  values  for  P^cC^). 

Pab(^')*  Pbc((/'^’ Pac’ Pab*  and  Pbc  into  Equations 
3,  7  and  10.  In  addition,  we  ran  each  of  the  three 
algorithms  on  data  generated  according  to  the  values 
of  the  parameters.  Figure  2  shows  the  three 
algorithms  when  Pac  is  0.4,  0.2  and  0.1.  In  each 
case,  and  are  (1  -  P^c^/Z.  The  values  of 
PacO^')*  Pab(^)>  Pbc(^')  randomly  assigned  for 
each  feature  from  the  range  (0.01  to  0.80). 

The  theoretical  values  predicted  by  the  average 
case  framework  presented  in  this  paper  allow  several 
conclusions  to  be  drawn  about  three  algorithms. 
First,  wholist  converges  to  100%  accuracy  more 
quickly  than  chaining  when  there  is  a  larger 
proportion  of  performance  examples,  while  chaining 


Pac^Pab'‘Pbc/^-<«+''>*  fl  nmiosc(fjjQM,N-mi)) 
hi 


[10] 
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converges  more  quickly  than  wholist  when  there  is  a 
larger  proportion  of  foundational  examples.  lOSC- 
TM  always  achieves  an  accuracy  greater  than  or 
equal  to  die  accuracy  of  chaining  or  wholist.  Of 
course,  similar  generalizations  about  the  behavior  of 
these  algorithms  can  be  drawn  by  inspecting  the 
algorithms.  However,  the  average-case  learning 
model  is  able  to  quantify  the  exact  conditions  under 
which  one  algorithm  will  produce  more  accurate 
results  than  anotlier. 

3  Conclusion 

In  this  paper,  we  have  presented  a  framework  for 
average  case  analysis  of  machine  learning 
algorithms.  Applying  the  framewoik  consists  of  1) 
understanding  how  an  algorithm  revises  a  hypothesis 
for  a  concept,  2)  calculating  the  probability  that  a 
training  example  will  be  encountered  that  causes  an 
inaccurate  hypothe.<jis  to  be  revised  and  5) 
calculating  the  effect  that  revising  a  hypothesis  on 
the  accuracy  of  the  hypothesis.  The  fiamewoik 
requires  much  more  information  about  the  training 
examples  than  the  PAC  learning  model.  The 
information  required  by  the  model  is  exactly  the 
information  required  to  generate  artificial  data  to  test 
learning  algorithms.  We  have  applied  the 
framework  to  three  different  learning  algorithms. 
We  have  verified  through  experimentation  that  the 
equations  accurately  predict  the  expected  accuracy. 
Although  we  have  currently  analyzed  only 
algorithms  for  conjunctive  concepts,  we  anticipate 
that  the  framework  will  scale  to  similar  algorithms 
using  more  complex  hypotheses.  Our  future  plans 
include  modeling  more  complex  learning  algorithms 
with  more  expressive  concepts  (e.g.,  ik-CNF  and  k- 
DNF)  and  using  statistical  techniques  to  derive  the 
information  needed  for  average  case  analysis  from 
existing  databases. 
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Figure  2:  A  comparison  of  the  expected  and  actual  accuracy  of  the  three  algorithm.  The 
points  are  the  empirical  means  with  95%  confldence  intervals  and  the  curves  are  the 
value  predicted  by  Equations  3, 7  and  10. 
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Abstract 


This  paper  describes  our  initial  implemen¬ 
tation  of  a  domain-independent  Integrated 
Learning  System  (iLS),  and  one  application, 
which,  through  its  own  experience,  discovers 
how  to  control  a  telecommunications  network. 
iLS  provides  a  framework  for  integrating  several 
heterogeneous  learning  agents,  in  this  case  im¬ 
plementations  of  inductive,  search-based  and 
knowledge-based  learning.  These  agents,  writ¬ 
ten  in  various  languages  and  executing  on 
various  platforms,  cooperate  to  improve 
problem-solving  performance.  iLS  also  includes 
a  central  controller,  called  The  Learning 
Coordinator  (Tlc),  which  manages  control  flow 
and  communication  between  the  agents  using  a 
high-level  communication  protocol.  The  agents 
provide  Tlc  with  expert  advice.  Tlc  chooses 
which  suggestion  to  adopt  and  performs  the  ap¬ 
propriate  actions.  At  intervals,  the  agents  can 
inspect  the  results  of  the  Tlc’s  actions  and  use 
this  feedback  to  leam,  improving  the  value  of 
their  future  advice.  At  present  iLS  is  being  ex¬ 
tensively  tested,  and  the  initial  results  arc 
promising. 

i.  Introduction 

This  paper  describes  our  initial  implementation  of  a 
domain-independent,  distributed  Integrated  Learning  Sys¬ 
tem  (iLS).  The  first  application  of  iLS  determines,  through 
experience,  how  to  control  a  telecommunications  net¬ 
work.  This  ongoing  work  addresses  issues  involved  in 
combining  various  learning  paradigms,  integrating  dif¬ 
ferent  reasoning  techniques,  and  coordinating  distributed 


cooperating  problem-solvers. 

A  wide  variety  of  learning  algorithms  have  been 
leporicd  in  the  liteialurc.  However,  no  one  algorithm  is 
adequate  for  a  wide  range  of  problems.  Our  approach  is 
to  integrate  several  algorithms. 

iLS  provides  a  framcwoik  for  integrating  several 
heterogeneous  learning  agents,  written  in  various  lan¬ 
guages  and  executing  on  various  platforms,  that  cooperate 
to  improve  problem-solving  performance.  It  also  includes 
a  central  controller  called  The  Learning  Coordinator 
(Tlc)  which  manages  control  flow  and  communication 
among  the  agents,  using  a  high-level  communication 
protocol.  The  agents  provide  Tlc  with  expert  advice  con¬ 
cerning  the  current  problem;  Tlc  then  chooses  which  sug¬ 
gestion  to  adopt,  and  performs  the  appropriate  actions. 
The  agents  compete  by  offering  potentially  differing  ad¬ 
vice  to  Tlc,  and  cooperate  to  overcome  gaps  in  their  in¬ 
dividual  knowledge.  At  intervals,  the  agents  can  inspect 
the  results  of  the  Tlc’s  actions  and  use  this  feedback  to 
leam.  As  they  leam,  either  autonomously  or  coopera¬ 
tively,  the  quality  of  advice  given  to  Tlc  increases,  lead¬ 
ing  to  better  performance. 

At  present,  iLS  contains  three  learning  agents, 
NetMan,  Fbi  and  Maclearn,  and  a  domain  simulator 
called  Netsim.  Figure  1  shows  the  current  iLS  architec¬ 
ture. 

The  agents  are  heterogeneous  in  that  tlicy  each  have  a 
different; 

•  model  of  the  domain 

•  area  of  expertise  in  the  domain 

•  learning  paradigm 
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Figure  1:  The  Integrated  Learning  System 


In  addition,  Tlc,  each  agent,  and  the  domain  simulator 
can  execute  in  parallel  on  difTcrent  machines. 

The  heterogeneous  approach  distinguishes  iLS  from 
Soar  [Laird  et  al  87],  which  uses  one  learning  paradigm 
(chut^g)  for  all  learning  tasks,  and  THEO  [Blythe  & 
Mitchell  89],  which  uses  one  representation  framework 
(frames)  for  all  learning  tasks.  Unlike  the  system 
described  in  [Falkenhainer  &  Rajamoney  88],  iLS  can  in¬ 
tegrate  learning  agents  systems  of  many  different  types. 

Section  2  briefly  describes  the  learning  paradigms  and 
their  specific  realizations  as  agents  in  iLS.  Section  3  dis¬ 
cusses  the  domain  of  experimentation,  telecommunica¬ 
tions  networic  traffic  control,  and  a  realistic  simulator 
called  Netsim.  Section  4  describes  the  iLS  communica¬ 
tion  protocols.  Section  5  discusses  the  operation  of  Tlc. 
Section  6  describes  the  inter-agent  cooperation  im¬ 
plemented  in  the  present  version  of  iLS.  Section  7  dis¬ 
cusses  the  possible  future  directions  of  this  work. 

2.  Some  Learning  Agents 

This  section  briefly  describes  three  learning  paradigms 
and  their  implementations  in  Its:  inducUve  (FBU,  search- 
based  (Maclearn)  and  knowledge-based  (NetMan). 
FBI  and  Maclearn  are  written  in  Symbolics  Lisp  and  run 
on  Symbolics  Lisp  Machines.  NetMan  is  written  in 
Quintus  Prolog  and  runs  on  Sun  Microsystems  worksta¬ 
tions. 


FBI,  Maclearn  and  NetMan  are  discussed  in  more 
detail  in  [Frawley  89,  Iba  89,  Silver  90]  respectively. 

2.1.  Inductive  Learning 

FBI  (Function-Based  /nduction),  [Frawley  89],  is  an 
extension  of  Quinlan’s  1D3,  [Quinlan  86].  Fbi  learns  deci¬ 
sion  trees  from  large  numbers  of  examples. 

Its  inputs  include: 

•  the  function,/,  to  be  approximated, 

•  the  set  of  approximators,  G, 

•  the  domain,  D,  (a  database)  on  which /and  the 
members  of  G  are  defined, 

•  and  the  measure  of  uncertainty  whose  local 
minima  are  used  to  define  the  tree.  The  default 
for  this  function  is  the  conditional  entropy. 

/is  a  finite-range  function  which  corresponds  to  the  goal 
concept  to  be  learned.  The  individual  approximators  in  G 
are  functions  which  combine  and  generalize  the  attributes 
of  the  database/). 

Fbi  returns  a  uee-sunctured  object  along  with  code  to 
evaluate  the  tree  function.  As  a  result  of  a  two-pass  pro¬ 
cedure,  the  structure  contains  no  isomorphic  subtrees. 
Thus,  branches  of  a  node  may  represent  subsets  of  values 
rather  than  individual  values.  Each  leaf  of  the  object 
encodes  the  associated  value  of  /  and  retains  pointers  to 
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all  the  database  entries  used  in  its  construction.  Each 
non-leaf  node  retains  pointers  to  its  inconsistent  database 
entries,  computed  as  follows.  First,  during  tree  construc¬ 
tion,  classes  of  inconsistencies  are  identified  and  for  each 
class  those  having  the  most-likely  /-value  on  the  class  are 
used  to  construct  the  tree.  Then,  the  inconsistent  data  that 
was  not  used  is  passed  through  the  nodes  and  branches 
down  to  those  non-leaf  nodes  having  no  suitable  branch, 
where  they  are  recorded  as  inconsistent 

In  addition,  by  adding  a  meta-level  field  to  the 
database  which  records  all  trees  computed  over  it,  it  is 
possible  to  update  the  trees  as  the  database  changes.  Each 
new  database  entry  is  passed  through  the  structure  of  each 
tree.  If  it  is  consistent  with  a  tree’s  approximation  to /it 
is  indexed  by  some  leaf  node  of  the  tree;  otherwise,  it  is 
indexed  by  the  inconsistencies  of  some  non-leaf  node. 

Unlike  ID3  which  is  used  to  build  one  decision  tree, 
FBI  is  used  to  build  and  maintain  a  sequence  of  trees 
employing  different  approximators  or  di^erent  methods 
of  approximation. 

•  Function-based  induction  can  iteratively  refine 
an  approximation  to  a  particular  classifier  in 
this  way:  First,  a  decision  tree  based  on  at¬ 
tributes  alone  is  computed.  Then,  the  tree  is  ex¬ 
amined  by  the  user  for  intelligibility  or  domain 
or  statistical  relevance.  Some  attributes  may  be 
removed  from  the  set  of  approximators;  new 
domain  and  context  functions  may  be  added  to 
the  approximators.  Then  the  tree  is  recom¬ 
puted,  re-examined,  etc. 

•  Consultation  programs  which  utilize  different 
rulesets  can  be  builU  Various  aspects  of 
domain  or  context  knowledge  may  be  suggested 
by  different  experts  or  domain  models.  These 
can  be  encoded  into  different  decision  trees  and 
the  statistics  of  each  and  performance  over  time 
compared. 

•  Computed  decision-tree  functions  can  be  used 
as  approximators  in  constructing  other 
decision-tree  functions. 

There  are  two  ways  in  which  FBI  makes  use  of  domain 
and  contextual  knowledge  in  constructing  approxima¬ 
tions.  First,  the  approximator  functions  can  appropriately 
combine  attributes.  For  example,  in  network  traffic  con¬ 
trol,  counts  of  call  attempts  and  call  completions  are 
recorded  for  all  trunk  groups  for  each  five-minute  period; 
but  these  numerical  attributes  are  not  individually  impor¬ 
tant  What  is  important  is  the  classification  of  ratios  of 
completions  to  attempts  to  indicate  over-  or  under¬ 
utilization  of  given  tmnk  groups.  The  use  of  thresholds 
for  certain  ratios  of  attributes  exemplifies  how  domain 
knowledge  directs  attribute  combination;  the  situation- 
dependent  values  of  thresholds  used  exemplify  contextual 
knowledge.  Second,  the  selection  of  the  next  best- 
approximator  can  depend  on  domain  or  context.  In 


straightforward  statistical  induction,  a  decision  tree  is 
constructed  recursively  by  selecting,  at  each  level,  the  ap¬ 
proximator  which  minimizes  an  uncertainty  function,  u, 
over  the  unresolved  data.  By  allowing  the  user  to  specify 
the  selection  method,  FBI  can  override  strict  minimization 
criteria.  For  example,  domain  knowledge  might  dictate 
that  the  approximator  g  should  appear  above  the  ap¬ 
proximator  A  in  any  tree  in  which  h  occurs.  Or  contextual 
information  unrelated  to  the  uncertainty  calculation  may 
provide  weights  {a,}  adjusting  u  to  select  the  minimizer 
Offl^,. 

I%i  also  includes  a  function-discovery  mode  which 
discovers  two  types  of  interesting  functions.  First,  there 
are  classifiers  deEned  by  tree  nodes  which,  during  tree 
construction,  have  replicated  immediate  subtrees. 
Secondly,  disjunctive  concepts  representing  paths  to  max¬ 
imal  subtrees  replicated  at  different  levels  in  the  tree  are 
automatically  defined.  This  extends  the  ongoing  woilc  in 
the  field  on  learning  disjunctive  normal  form  (DNF)  con¬ 
cepts  using  decision  trees  as  a  concept  description  lan¬ 
guage,  [Pagallo  &  Haussler  89,  Pagallo  89,  Matheus  & 
Rcndall  89]. 

Often  the  concepts  discovered  in  this  way  are  useful 
because  they  reflect  genuine  features  of  the  domain.  In 
other  cases,  the  concepts  discovered  are  artifacts  caused 
by  random  patterns  in  the  example  set  FBI  can  ask  other 
agents  of  ILS,  in  particular  NbtMan,  to  assess  the  utility 
of  a  discovered  concept  This  is  discussed  in  Section  6.3. 

2.2.  Search-based  Learning 

MACLEARN[Iba  89,  Iba  88]  currently  performs 
best-first  search  (see,  for  example,  [Nilsson  80])  in  order 
to  learn  macro-operators,  or  macros,  which  are  useful 
combinations  of  operators  that  can  be  subsequently 
treated  as  a  single  operator.  Macro-learning  is  a  form  of 
chunking  that  can  improve  search  performance  by  enlarg¬ 
ing  the  set  of  operators  available  for  the  search.  The 
availability  of  a  good  set  of  macros  will  often  drastically 
reduce  the  combinatorially  explosive  nature  of  a  search 
problem. 

Macros  reduce  search  in  two  general  ways.  The  first  is 
to  take  larger  steps  in  the  search  space.  Since  a  macro 
actually  represents  a  number  of  primitive  operators,  the 
application  of  a  single  macro  can  result  in  moving  a 
greater  distance  through  the  search  space.  A  problem  that 
may  require  hundreds  of  primitive  steps  to  solve,  may  be 
solvable  more  quickly  by  only  tens  of  applications  of 
macro  steps. 

The  second  way  is  to  provide  synergistic  combinations 
of  operators:  applying  a  set  of  operators  as  a  group  may 
be  teneficial,  whereas  applying  any  operator  alone  may 
make  things  worse.  In  such  a  case,  it  may  be  difficult  for 
a  search  to  discover  the  combination;  but  once  it  has  been 
discovered  it  is  valuable  to  remember  it  in  order  to  avoid 
having  to  duplicate  the  lengthy  search  in  subsequent 
similar  situations.  Thus  macro  learning  can  be  viewed  as 
a  kind  of  encapsulation  of  experience. 
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However,  there  is  a  cost  associated  with  creating  mac¬ 
ros.  Macros  enable  larger  steps  through  the  search  space, 
but  they  increase  the  number  of  possible  branches  at  each 
step.  This  increased  braiKbing  factor  will  tend  to  slow 
down  the  search.  In  order  for  macros  to  prove  beneficial, 
the  advantages  must  more  than  compensate  for  the  dis¬ 
advantages.  Thus  a  macro  learning  program  must  have  a 
means  of  deciding  which  macros  to  keep  and  which  to 
discard.  Maclearn  uses  various  heuristic  criteria  to  per¬ 
form  this  filtering  process. 

On  complex  problems,  Maclearn  may  encounter  a 
combinatorial  explosion  as  the  search  space  of  possible 
operators  becomes  too  large.  In  such  cases,  Maclearn 
bkomes  bogged  down  in  the  search  and  may  be  unable  to 
And  a  satisfactory  solution.  Other  agents  within  iLS  can 
provide  assistance  to  Maclearn  by  constraining  the 
search  and  indicating  which  part  of  the  search  space 
should  be  examined.  This  is  discussed  further  in  Section 
6.2. 

2J.  Knowledge-Based  Learning 

NbtMan,  [Silver  90],  is  an  example  of  a  knowledge- 
intensive  learning  systera  The  algorithm  used  is  based  on 
explanalion-based  learning  (EBL),  [DeJong  &  Mooney 
86,  Mitchell  et  at  86],  modified  to  work  with  an  imperfect 
domain  theory,  [Silver  86,  Silver  88]. 

NetMan  starts  with  a  large  amount  of  domain 
knowledge,  expressed  as  rules.  However,  the  knowledge 
need  be  neither  complete  nor  totally  accurate.  As  a  result, 
NetMan  can  make  mistakes.  (Human  experts  suffer 
from  the  same  limitation,  of  course.)  In  EBL  terms,  the 
domain  theory  is  incomplete  and  computationally  intract¬ 
able,  and  so  NetMan  uses  a  heuristic  approximation. 

NetMan  learns  four  major  types  of  information  from 
experience: 

1.  Stored  caches'.  NetMan  stores  as  a  macro  the 
sequence  of  mle  firings  that  led  to  advice  that 
woiiced. 

2.  Support  List:  The  support  list  indicates  how 
successful  or  unsuccessful  a  particular  action 
proved  to  be.  Those  that  have  proved  valuable 
in  the  past  are  more  likely  to  be  used  in  the 
future. 

3.  Possible  Bugs:  When  an  action  fails  to 
achieve  the  expected  effect,  NetMan  can 
classify  the  cause  and  severity  of  this  failure. 
This  information  is  stored  and  will  affect  fu¬ 
ture  use  of  the  action. 

4.  Plans:  A  plan  consists  of  a  sequence  of  actions 
that  have  proved  useful,  together  with  the  ex¬ 
pected  effect  of  each  action. 

When  actions  have  unexpected  effects,  NetMan  at¬ 
tempts  to  explain  the  cause  of  the  unpredicted  behavior. 
This  analysis  allows  NetMan  to  discover  that  sometimes 
an  action  will  fail  due  to  a  bug.  NetMan  is  able  to 


classify  the  type  of  bug,  and  associate  this  type  with  the 
action.  Note  that  the  action  may  often  work  successfully; 
bugs  may  occur  only  in  certain  situations. 

Ideally,  NetMan  would  be  able  to  accurately  distin¬ 
guish  between  situations  in  which  an  action  will  be  suc¬ 
cessful  and  those  in  which  it  will  fail  with  a  bug.  Unfor¬ 
tunately,  the  computation  involved,  and  the  stochastic  na¬ 
ture  of  the  domain,  make  this  impossible  to  do  precisely. 
Instead,  NetMan  beuristically  differentiates  the  cases  by 
calling  on  another  component  of  iLS,  FBI.  Section  6.1 
describes  this  interaction. 

3.  The  Problem-Solving  and  Learning  Tasks 

Consider  a  system  or  process  characterized  by  a  time- 
dependent  internal  state,  S,  whose  quality  or  merit  at  any 
given  time  is  described  by  an  evaluation  function  e  with 
values  in  the  interval  [0,1].  The  system  responds  with  a 
time  lag  to  service  demands  and  to  whatever  control  has 
been  imposed  on  it  The  system  goal  is  to  operate  with  its 
evaluation  at  or  near  1.  In  complex  systems,  subject  to 
widely  varying  demand  and  the  possibility  of  subsystem 
failure,  this  can  be  difficult  At  certain  discrete  times,  , 
an  external  controller  imposes  one  of  N  controls, 
Aj, . . .  ,A^,  in  order  to  increase  the  value  of  e{f)  at  sub¬ 
sequent  times.  One  way  to  measure  the  effectiveness  of 
the  choice  of  a  control  at  time  t^  is  to  compare  the  states 
just  before  and  sometime  after  the  control  action.  For 
example,  with  e=c(5(/p)  and  c''=e(iS(t,+A)),  a  simple 
function  such  as  f{e,e')=(e'-e)lil-e)  may  suffice.  The 
number  of  control  options,  N,  is  one  measure  of  the  com¬ 
plexity  of  system  control. 

The  problem-solving  role  of  a  controlling  agent  is  to 
observe  c(5(/))  or  S{t)  over  time  and,  at  times  of  its  own 
choosing,  to  actuate  new  choices  of  control.  In  the  ab¬ 
sence  of  an  adequate  model  of  the  system,  it  may  not  be 
possible  to  assess  how  well  a  problem-solver  is  doing. 
But,  by  presenting  the  same  or  similar  situations  to  two 
competing  controllers,  it  is  possible  to  determine  which  is 
more  effective.  The  learning  role  of  a  controlling  agent  is 
to  improve  based  on  experience;  that  is,  the  self-adapted 
agent  is  to  be  more  effective  than  the  agent  in  its  original 
state. 

3.1.  Application  Domain;  Network  Traffic  Control 

The  dynamic  system  control  problem  used  as  the  basis 
for  iLS  experimentation  is  the  control  of  traffic  in  a 
circuit-switched  telephone  network.  Network  state  is 
defined  by  a  considerable  volume  of  data  regarding  call 
placement  and  switch  and  tmnk-group  usage  provided  on 
a  minute-by-minute  basis.  The  overall  network  evalua¬ 
tion  function  used  here  is  the  percentage  of  attempted 
calls  that  are  successfully  completed,  averaged  over  the 
most  recent  five-minute  interval.  Generally,  human  net¬ 
work  traffic  managers  strive  to  keep  this  value  above  99% 
but  there  are  many  types  of  situations  in  which  a  con¬ 
siderable  percentage  of  calls  fail  and  intervention  is  re¬ 
quired. 
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The  task  is  difficult,  partly  because  of  the  huge  amount 
of  data  produced  and  its  time-varying  nature.  There  is 
some  reliance  on  preplans',  standard  procedures  for  pre¬ 
dictable  occurrences,  such  as  Mother’s  Day,  which  is  the 
busiest  calling  day  of  the  year,  or  for  special  contests  that 
radio  stations  occasionally  offer  (“We  will  give  two  free 
tickets  for  tomorrow’s  show  to  the  Hrst  ten  callers  with 
the  correct  answer...’’).  However,  unpredictable 
problems  arise,  and  they  cause  the  major  difficulty.  One 
example  is  the  partial  or  total  failure  of  a  network  element 
(a  trunk  group  or  a  switch)  which  may  cause  some  un¬ 
avoidable  denial  of  service  to  some  users;  however,  effec¬ 
tive  traffic  management  can  greatly  improve  the  situation, 
for  example,  by  rerouting  traffic  around  the  failed  ele¬ 
ment 

Many  other  Artificial  Intelligence  techniques  have 
been  applied  Ic  Network  Traffic  Management  including 
Case-Based  Reasoning  [Kopcikina  et  al  88],  Distributed 
AI  [Adler  et  al  89,  Brandau  &  Weihmayer  89]  and  tradi¬ 
tional  Expert  Systems  {e.g.  [Kosieniak  et  al  88]).  These 
approaches  suffer  from  the  inflexibility  of  all  non- 
learning  programs:  an  inability  to  learn  from  experience 
and  thereby  to  improve  their  control  of  the  network. 

3.2.  The  Simulator 

The  performance  module  for  Its  experiments  is  a  net¬ 
work  simulator  called  Netsim,  [Frawley  et  al  88]. 
Netsim  conducts  a  fine-grained  simulation  of  the  call 
placement  process  in  a  network  of  end-offices  and  tandem 
switches  and  implements  a  set  of  controls  applicable  to 
switches  and  tmnk  groups.  One  property  of  this  domain 
simulation  is  its  complexity.  In  ongoing  experiments  on  a 
ten-switch,  sixteen-tmnk  network  there  are  always  at  least 
30(X)  individual  “legal  moves”.  Moreover,  in  many 
cases  it  is  reasonable  to  impose  several  controls  simul¬ 
taneously,  increasing  the  number  of  allowable  control  op¬ 
tions  considerably. 

4.  The  ILS  Protocol 

ILS  is  completely  distributed.  Within  iLS,  agents  inter¬ 
act  via  TCP/IP  streams  using  a  communication  protocol 
consisting  of  the  six  message  (request)  types  enumerated 
below.  Any  agent  may  access  any  other  agent  using  Ils 
language  primitives  or  any  other  calls  understood  by  the 
target  agent. 

1. The  INITIALIZE  message  is  a  request  for 
an  agent  to  iiiitialize  itself  and  set  up  com¬ 
munication  with  other  agents.  This  message 
includes  information  specifying  the  current 
host  machine  for  each  of  the  other  agents.  1  he 
response  to  this  message  is  simpiy  an  ack¬ 
nowledgment  that  system  and  communication 
initialization  is  complete. 

2.  The  ADVISE  message  is  a  request  for  advice 
on  what  to  do  in  the  current  state  of  the  exter¬ 
nal  domain  to  be  controlled.  An  agent 


responds  to  this  message  by  examining  the 
current  state  and  deciding  what  actions  would 
be  best  for  improving  this  state.  A  vote  (in  the 
range  1-5)  accompanies  each  recommen¬ 
dation,  indicating  how  beneficial  the  agent 
predicts  this  advice  is  likely  to  be.  If  the  agent 
is  unable  to  come  up  with  advice  that  im¬ 
proves  the  situation,  it  may  respond  with  the 
reply  unknown. 

3.  The  CRITIQUE  message  is  a  request  for  an 
agent  to  provide  its  opinion  of  advice  offered 
by  other  agents.  The  specific  pieces  of  advice 
other  agents  have  proposed  are  passed  along 
as  part  of  this  message.  The  response  to  this 
request  is  a  set  of  votes  (again  in  the  range 
1-5)  reflecting  the  value  of  each  piece  of  ad¬ 
vice  as  viewed  by  the  responding  agent.  A 
vote  of  UNKNOWN  is  given  for  those  pieces  of 
advice  that  an  agent  is  not  able  to  meaning¬ 
fully  critique. 

4.  The  DATA-AVAILABLE  message  is  used  to 
notify  agents  which  of  the  last  suggested  sets 
of  actions  were  taken  and  that  a  sufficient 
period  of  time  has  passed  since  they  were 
taken  to  allow  domain  data  to  reflect  those  ac¬ 
tions.  In  response  to  this  message,  agents  can 
gather  feedback  from  the  domain  in  order  to 
leam.^  Finally,  they  send  back  an  ack¬ 
nowledgment  that  they  have  finished  process¬ 
ing  the  current  data. 

5.  The  RECORD  message  is  sent  to  agents  which 
maintain  their  own  databases  of  past  cases. 
One  agent  asks  another  to  record  additional 
information  about  a  given  case  for  later  recall 
or  processing.  For  example,  this  primitive  is 
used  by  NetMan  to  select  specif  cases  for 
processing  by  Fbi.  if  the  message  recipient 
does  not  have  a  recording  facility,  it  returns 
UNKNOWN. 

6.  The  CIASSIFV  message  is  a  request  for  an 
agent  to  classify  the  current  state  of  the  exter¬ 
nal  domain  with  regard  to  a  particular  predi¬ 
cate  used  in  a  previous  RECORD  message.  For 
example,  NetMan  uses  this  ines-sage  to  ask 
FBI  to  classify  the  current  state  using  a  tree 
constructed  from  states  that  have  been 
RECORDed. 


'The  technique  used  to  gather  data  from  the  domain  is  agent-specifie. 
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5.  The  Learning  Coordinator 

In  its  current  state,  the  iLS  relies  on  a  Learning  Coor¬ 
dinator  (Tlc)  to  manage  the  control  flow  between  the 
agents  and  the  simulated  domain.  Tlie  basic  loop, 
repeated  every  five  simulated  minutes,  is  as  follows: 

1.  Tlc  asks  the  agents  to  propose  actions  to  con¬ 
trol  the  network.  Each  agent  returns  a  list  of 
possible  actions,  and  associates  with  each  ac¬ 
tion  a  vote,  a  number  between  1  and  5,  indicat¬ 
ing  the  agent’s  perception  of  the  value  of  that 
action. 

2.  Tlc  then  asks  the  agents  to  critique  the 
proposals  of  the  other  agents.  Each  agent 
returns  a  vote  for  each  proposal. 

3.  Now  Tlc  has  a  list  of  proposed  actions,  and 
each  action  has,  at  present,  a  set  of  up  to  three 
votes  associated  with  it  Tlc  has  to  choose 
between  these  proposals.  There  are  several 
possible  techniques  that  can  be  used  for  this 
selection  process,  some  of  which  arc  described 
below.  As  currently  implemented,  Tlc 
averages  the  votes  for  each  proposal  and  ex¬ 
ecutes  the  action  with  the  highest  average 
score. 

4.  Five  simulated  minutes  later  new  switch  statis¬ 
tics  are  produced.  The  agents  inspect  these 
statistics  to  obtain  feedback  on  the  success  or 
failure  of  the  chosen  action,  allowing  tliem  to 
learn  appropriately  to  affect  their  future 
problem-solving  performance. 

The  following  sections  discuss  the  individual  steps  in 
more  detail,  using  Netsim  as  the  domain  simulator. 

5.1.  Proposing  Controls 

The  agents  individually  obtain  network  statistics 
directly  from  Netsim  via  TCP/IP  streams.  Each  agent 
examines  the  statistics  and  attempts  to  produce  an  ordered 
list  of  actions  that  the  agent  believes  will  improve  tlie 
network  state.  An  action  may  remove  some  controls  al¬ 
ready  in  place  and/or  impose  some  new  controls. 

Alternatively,  an  agent  can  suggest  taking  no  action,  if 
it  believes  that  the  current  state  of  the  network  is  satis¬ 
factory,  or  it  can  indicate  that  it  does  not  know  what  to  do. 
In  the  last  case,  no  further  processing  is  performed. 

No  agent  will  propose  actions  that  appear  to  make  the 
situation  worse.  If  an  agent  cannot  find  an  action  that  will 
improve  a  bad  network  state,  it  will  indicate  that  it  docs 
not  know  what  to  do. 

Each  agent  must  also  calculate  a  vote  for  each 
proposed  action;  in  other  words,  determine  how  much  it 
believes  in  the  action  that  it  is  proposing.  In  the  current 
implementation,  each  agent  calculates  the  vote  in  a  dif¬ 
ferent  way,  described  below.  The  architecture  docs  not 
require  that  the  agents  have  a  uniform  semantics  for 
voting.  However,  each  agent  should  be  internally  consis¬ 


tent;  an  agent  should  expect  that  actions  it  scores  highly 
will  perform  better  than  those  actions  to  which  it  gives  a 
lower  score. 

Maclearn  calculates  the  vote  using  an  internal 
simplified  simulator,  called  FLosim.  It  compares  the 
simulated  outcome  with  the  current  network  state,  giving 
a  very  high  vote  if  the  action  appears  to  greatly  improve 
(he  situation,  and  a  somewhat  lower  vote  if  the  action 
promises  only  a  small  improvement. 

FEi  calculates  its  vote  using  its  database  of  previous 
examples.  If  the  proposed  action  has  greatly  helped 
similar  situations  in  the  past,  it  receives  a  very  high  vote, 
otherwise  a  somewhat  lower  vote.  An  example  consists 
of  descriptions  of  an  initial  network  state,  a  set  of  actions, 
and  the  network  state  that  results  from  taking  those  ac¬ 
tions  and  mnning  the  network  for  an  additional  period  of 
simulated  time.  Two  preferred  subsets  are  maintained, 
one  of  examples  whose  rankings  are  very  high,  and  one 
for  examples  whose  rankings  are  high.  learned  decision 
trees  defined  over  these  subsets  map  an  initial  state  to  a 
set  of  actions;  the  advice  provided  is  either  a  very  high  set 
of  actions,  a  high  set,  or  none.  FBI’s  vote  depends  on 
success  ratios  of  similar  actions  in  the  database. 

NetMan  bases  its  vote  partly  on  the  basis  of  past  ex¬ 
perience.  It  also  uses  its  domain  theory,  taking  into  ac¬ 
count  the  effect  of  controls.  For  example,  an  action 
receives  credit  if  it  contains  controls  that  move  traffic 
from  overflowing  bunks  to  non-overflowing  ones. 
NetMan  calculates  the  vote  for  a  proposal  by  combining 
the  vote  based  on  past  experience  (if  available)  with  the 
vote  based  on  domain  knowledge. 

5.2.  Critiquing  Suggestions  of  Others 

In  general,  an  agent  critiques  a  proposal  using  the  same 
mechanism  used  to  produce  a  vote.  Thus  Maclearn 
uses  FLosim  to  simulate  the  effect  of  the  proposed  con- 
ttols,  Fbi  uses  a  decision  tree,  computed  over  all  ex¬ 
amples,  mapping  pairs  of  inibal  states  and  actions  into 
rankings  to  evaluate  the  expected  effect  of  a  proposed 
action,  and  NetMan  uses  a  domain  model  and  previous 
experience  to  estimate  the  effectiveness  of  the  proposal. 
NetMan  also  performs  a  series  of  “sanity  checks’’  on 
the  proposal.  For  example,  rerouting  baffic  to  a  bunk  that 
is  currently  non-operational  is  a  bad  mistake,  and  a 
proposal  containing  such  a  reroute  will  receive  a  poor 
score. 

In  some  cases  an  agent  may  indicate  that  it  has  no 
opinion  on  a  proposal.  This  can  happen  if  the  agent  has 
no  knowledge  or  experience  concerning  a  particular  con- 
bol  or  set  of  conbols.  Such  critiques  are  discarded. 

5  J.  Choosing  between  the  Proposals 

As  indicated  above,  currently  Ils  chooses  the  action 
with  the  highest  average  score.  Other  alternatives  in¬ 
clude: 

•  Choose  the  action  with  the  highest  raw  score;  in 
other  words  ignore  the  effects  of  cribquing. 
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•  Choose  the  action  that  is  highly  scored  by  more 
than  half  of  the  agents. 

•  Choose  the  action  proposed  by  the  agent  that 
has  proved  most  reliable  in  the  past 

•  Weight  the  votes  of  an  agent  in  proportion  to  its 
past  performance. 

These  strategies  are  currently  being  evaluated. 

The  last  two  suggestions  above  require  that  Tlc  keep 
track  of  the  performance  of  each  agent,  and  thus  itself 
learn  to  choose  among  the  various  proposals.  This  can  be 
done  with  varying  degrees  of  sophistication.  The  simplest 
method  is  to  increase  the  perceived  reliability  of  an  agent 
when  its  choice  is  selected  and  causes  improvement,  and 
to  decrease  the  perceived  reliability,  if  its  action  makes 
the  network  worse. 

One  problem  with  this  approach  is  that  Tlc  cannot 
know  whether  a  successful  action  was  the  best  of  all  of 
the  suggested  actions;  this  is  inherent  in  every  scheme 
that  does  not  perform  a  total  search.  Thus  an  agent  inay 
be  rewarded  when  the  untried  suggestion  of  another  agent 
would  have  caused  greater  improvement  Another  dif¬ 
ficulty  with  this  method  is  that  the  agents  arc  learning,  so 
that  an  agent  that  originally  provided  poor  advice  may 
now  produce  effective  suggestions.  One  way  of  correct¬ 
ing  this  problem  is  to  discount  earlier  performance,  giving 
more  weight  to  recent  experience, 

Neveitheless,  empirically  the  simple  technique  appears 
promising.  It  can  be  extended  to  the  critiquing  process  as 
well;  if  an  agent  gave  a  low  score  to  an  action  that  was  in 
fact  successful,  Tlc  can  decrease  its  perceived  reliability 
and  similarly  in  the  other  cases. 

All  of  the  techniques  described  above  result  in  Tlc 
selecting  one  of  the  proposed  actions.  A  more  sophis¬ 
ticated  approach  would  be  to  form  new  advice  by  cornbin- 
ing  the  proposals  of  individual  agents.  Various  combina¬ 
tion  techniques  could  be  used.  For  example,  a  purely 
syntactic  method  might  use  union  or  intersection  opera¬ 
tions  to  generate  the  new  advice.  Alternatively,  a  more 
knowledge-based  approach  might  detect  that  two  of  the 
proposals  treat  independent  separable  problems,  and  so 
could  be  usefully  combined. 

5.4.  Obtaining  Feedback 

One  interesting  feature  of  the  ILS  architecture  is  that 
some  agents  can  leam  even  when  another  agent’s  action  is 
the  one  that  is  chosen.  In  the  case  of  FBI,  the  mechanism 
is  particularly  simple;  the  whole  cycle  is  just  another  ex¬ 
ample,  consisting  of  a  before  state,  an  action  and  an  after 
state.  NetMan  is  able  to  leam  if  the  action  chosen  is 
sufficiently  similar  (according  to  a  complex  metric)  to  an 
action  it  proposed. 

At  present,  Maclearn  does  not  make  use  of  the  feed¬ 
back. 


6.  Inter>agent  Cooperation 

Section  5.2  described  one  level  of  cooperation  between 
the  agents:  the  ability  to  critique  the  actions  of  each  other. 
This  section  further  describes  our  ongoing  work  on  some 
of  the  ways  in  which  the  various  agents  interact  to  im¬ 
prove  the  performance  of  individual  agents.  In  the  current 
system,  only  the  first  mechanism  is  implemented;  the 
others  will  be  added  shortly. 

6.1.  Concept  Formation 

Ideally,  NetMan  would  be  able  to  distinguish  ac¬ 
curately  between  situations  in  which  an  action  will  be 
successful  and  those  in  which  it  fails  in  a  certain  way. 
However,  the  complexity  and  stochastic  nature  of  the 
domain  make  this  an  unrealistic  target.  Instead,  NetMan 
heuristically  differentiates  the  cases  by  calling  on  FBI. 

FBI  is  given  two  classes  of  examples;  in  one  class  are 
all  the  examples  where  the  failure  mode  occurred,  in  the 
other  are  the  cases  where  the  action  worked.  FBI  is  used 
to  produce  a  function  that  heuristically  classifies  network 
states  as  likely  to  suffer  from  that  failure  mode  or  not.  If 
NetMan  is  considering  performing  the  action,  it  asks  the 
inductive  learning  program  for  its  classification.  If  the 
current  state  is  classified  as  likely  to  fail,  and  the  failure 
mode  has  proved  sufficiently  severe,  NetMan  does  not 
propose  the  action.  In  effect,  the  classification  step  is 
added  as  a  precondition  to  the  treatment  rule  that  proposes 
the  action. 

6.2.  Reducing  the  Search  Space 

As  mentioned  previously,  Maclearn  can  be  over¬ 
whelmed  by  the  combinatorial  explosion  on  complex  net¬ 
works.  This  problem  could  alleviated  by  giving 
Maclearn  constrained  search  problems.  In  a  future  ver¬ 
sion  of  iLS,  the  constraints  would  come  from  FBI  or 
NetMan  detecting  important  features  of  the  current  net¬ 
work  state.  The  search  could  be  constrained  in  various 
ways: 

•  Reduced  operator  sets:  Maclearn  is  told  to 
consider  only  certain  types  of  controls. 

•  Constrained  areas  of  application;  Maclearn  is 
told  to  consider  placing  controls  on  only  the 
specified  network  elements. 

•  New  starting  state:  Maclearn  is  told  to  as¬ 
sume  that  certain  controls  must  be  present. 

6  J.  Assessing  Discovered  Concepts 

As  described  in  Section  2.1,  FBI  can  automatically  dis¬ 
cover  potentially  interesting  concepts  that  are  combina¬ 
tions  of  other  existing  functions.  Often  the  concepts  dis¬ 
covered  in  this  way  arc  useful  because  they  reflect 
genuine  features  of  the  domain.  In  other  cases,  the  con¬ 
cepts  discovered  are  artifacts  caused  by  random  patterns 
in  the  example  set, 

A  knowledge-based  agent,  such  as  NetMan,  could  be 
used  to  heuristically  classify  a  concept  as  useful  or  not 


ILS:  A  FVamework  for  Multi-Paradigmatic  Learning  355 


useful.  When  FBI  proposes  a  new  concept,  it  could  ask 
NetMan  if  there  is  any  domain  knowledge  that  combines 
the  functions  contained  in  the  concept  If  NetMan  did 
have  such  domain  knowledge,  the  concept  would  prob¬ 
ably  be  a  genuine  one.  If  not  the  concept  may  be  an 
artifact,  although  perhaps  NetMan  is  missing  some 
domain  knowledge. 

7.  Conclusions  and  Further  Work 

Its  is  a  framework  for  integrating  several  distributed 
heterogeneous  learning  agents  that  cooperate  to  improve 
problem-solving  performance.  The  agents  learn  both  in¬ 
dependently  and  cooperatively.  Each  agent  has  been 
tested  independently  using  telecommunications  traffic 
control  as  a  domain,  and  each  has  demonstrated  that  it  can 
learn  by  interacting  with  that  domain.  At  present  iLS  is 
being  extensively  tested  using  the  same  domain.  One 
focus  of  the  current  research  is  the  relative  utility  of 
various  TLC  selection  strategies.  Simple  strategies  appear 
to  be  effective. 

7.1.  Further  Work 

There  are  several  directions  that  the  research  is  ex¬ 
pected  to  follow  in  the  near  future: 

•  Detailed  experiments  will  be  performed  to 
evaluate  the  iLS  framework.  At  a  minimum,  to 
be  successful.  Its  must  demonstrate: 

•  Enhanced  problem-solving  performance: 

The  iLS  as  a  whole  must  do  better  than 
each  of  its  constituent  parts. 

•  Improved  Learning:  The  learning  of  each 
agent  will  be  accelerated  within  iLS,  and 
each  agent’s  knowledge  will  be  adjusted 
to  be  more  effective. 

•  More  learning  agents  will  be  added  in  order  to 
assess  the  completeness  of  the  communication 
protocols  and  to  explore  inter-agent  coopera¬ 
tion.  The  next  to  be  added  will  be  a  pattern- 
based  learning  program  such  as  Aqii 
[Michalski  &  Chilausky  80]. 

•  Other  forms  of  search  and  methods  for  con¬ 
straining  them  will  be  examined.  This  effort 
will  enhance  the  work  described  in  Sections  2.2 
and  6.2  on  best-first  search  and  on  how  agents 
can  provide  advice  to  constrain  such  searches. 

•  The  areas  of  inter-agent  cooperation  will  be  ex¬ 
panded. 

•  iLS  will  also  be  tested  in  another,  yet  to  be 
determined,  application  domain. 

Another  possibility  under  active  consideration  is  to 
remove  Tlc  by  distributing  its  responsibilities  amongst 
the  learning  agents.  This  approach  allows  the  agents  to 
behave  more  autonomously,  and  drav/s  on  the  existing 


work  in  Distri'ruted  Artificial  Intelligence,  [Brandau  & 
Weihmayer  89]. 

The  intent  is  ultimately  to  develop  a  domain 
independent  Its. 
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ABSTRACT 

The  paper  presents  an  integrated 
framework,  namely  IRl,  for  inducing 
the  inference  rules  from  examples. 
Achieving  this  requires  considering  the 
generality,  inducibility,  incrementality, 
uncertainty  and  hierarchy  of  expert 
rules.  Our  approach  is  a  mutual  inte¬ 
gration  of  empirical  and  analytical 
techniques  to  avoid  shortcomings  of 
them  when  applied  individually.  It  also 
incorporates  some  interesting  ideas  or 
algorithms  to  achieve  these  goals.  Pri¬ 
mary  experimental  results  arc  encour¬ 
aging  and  more  work  is  required. 

1.  Introduction 

Quinlan's(l  983,1 986)  ID3  is  an  interesting 
algorithm  which  induces  a  decision  tree  from  ex¬ 
amples.  Each  example  is  depicted  by  several  at¬ 
tributes  and  belongs  to  a  class.  The  task  is  to  de¬ 
rive  a  logical  description  of  each  class.  The  signif¬ 
icance  of  the  problem  varies  with  the  different 
appearances  of  the  description.  For  instance,  if 
each  class  represents  a  decision  and  each  attri¬ 
bute  an  aspect  of  the  situation,  the  description 
may  be  interpreted  as  a  decision  rule;  if  each  class 
corresponds  to  a  kind  of  disease  and  attribute  a 
symptom,  the  description  may  be  interpreted  as  a 
diagnosis  rule.  In  other  words,  solving  the  prob¬ 
lem  could  greatly  contribute  to  widening  the 
knowledge  acquisition  bottleneck  by  inducing 
rules  automatically. 

A  lot  of  improvements  have  been  done  since 
ID3.  For  instance,  Schlimmcr  &  Fisher  (1986) 


constructed  ID4 - an  incremental  ID3; 

UtgolT(1988)  built  IDS - a  more  elaborate 

incremental  ID3;  Ccndrowska(1987)'s  PRISM  is 
able  to  obtain  a  set  of  the  simple  classification 
rules  instead  of  a  decision  tree;  Quinlan  (1987) 
also  simplifies  decision  tree  into  a  set  of  modular 
production  rules.  However,  the  rules  or  trees  in¬ 
duced  by  them  hardly  capture  the  characteristics 
of  expert  rules  used  in  making  inferences.  The 
paper  is  totally  devoted  to  solving  this  problem. 
We  begin  our  approach  by  examining  character¬ 
istics  of  human  inference  rules.  A  learning 
framework  is  then  presented  to  integrate  similari¬ 
ty-based  and  explanation-based  approaches  to 
induce  rules. 

2.  Exemplary  Domain 

We  give  an  exemplary  problem  excerpted 
from  Ccndrowska(I987).  An  adult  spectacle 
wearer  wants  to  purchase  her  first  pair  of  contact 
lenses.  In  optician's  point  of  view,  this  is  a 
thrcc-catcgory  classification  problem.  His  deci¬ 
sion  will  be  one  of: 

@1:  the  patient  should  be  fitted  with  hard  con¬ 
tact  lenses, 

@2:  the  patient  should  be  fitted  with  soft  contact 
lenses, 

@3:  the  patient  should  not  be  fitted  with  contact 
lenses. 

In  reaching  his  decision  he  must  consider  one  or 
more  of  four  factors: 
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a;  the  age  of  the  patient 

1.  young,  2.  pre-presbyopic,  3.  presbyopic 
b:  her  spectacle  prescription 
1.  myope,  2.  hypermetrope 
c;  whether  she  is  astigmatic 
1.  no,  2.  yes 

d;  her  tear  production  rate 
1.  reduced,  2.  normal 

Table  1  shows  the  optician^s  decision  for 
each  combination  of  the  four  factors.  However, 
the  optician  just  uses  his  rule-based  experience 
instead  of  carrying  such  a  table  around  with  him, 
either  on  his  person  or  in  his  head. 


ID3's  decision  tree: 
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TABLE  1 .  decision  table  for  fitting  contact  lens 
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Based  on  Table  I  as  the  training  set,  ID3 
will  obtain  a  decision  tree  and  PRISM  will  pro¬ 
duce  a  set  of  rulcs(Fig.l).  The  obvious  improve¬ 
ment  by  PRISM  is  that  its  rules  arc  similar  to  the 
ones  the  optician  may  use.  Its  fundamental  im¬ 
provement  lies  on  its  eliminating  redundance. 
For  example,  if  the  patient  is  a  presbyope  with 
high  hypermetropia  and  astigmatism(i.c., 
a3&b2&c2),  the  optician  would  know  immediate¬ 
ly  that  she  was  not  suitable  for  contact  lens  wear. 
PRISM  can  reach  this  conclusion  by  using  rule  9. 
However,  ID3  needs  to  know  her  tear  produc¬ 
tion  (value  of  d)  in  order  to  make  decision.  This 
test  is  normally  carried  out  with  a  lot  of  time  and 
fee.  It  would  be  quite  understandable  if  the  pa¬ 
tient  becomes  upset  or  angry  on  finding  the  test 
is  unnecessary.  The  consequence  could  be  more 
serious  if  the  attribute  involves  surgery. 


prism's  rules: 

1.  c2  &  d2  &  bl  =  =  >@1 

2. al  &c2&d2=  =  >@1 

3.  cl  &d2&b2=  =  >@2 

4.  cl  &d2&al=  =  >@2 

5.  cl  &d2&a2=  =  >@2 

6. dl  =  =  >@3 

7. a3&bl  &cl=  =  >@3 

8.  b2  &c2&  a2=  =  >@3 

9.  b2  &c2  &  a3=  =  >@3 

Fig.  1.  The  output  of  ID3  and  PRISM 

3.  Characteristics  of  inference  rules 

In  spite  of  such  an  improvement,  PRISM,  in 
its  essence,  is  a  classification  system  and  requires 
providing  all  the  necessary  attributes  prior  to  de¬ 
cision  making.  We  still  find  the  differences  be¬ 
tween  prism's  rules  and  expert  rulcs(or  infer¬ 
ence  rules).  These  include: 

A.  Hierarchy.  The  inference  rules  are  being 
linked  together  to  form  an  inference  link.  The 
intermediate  conclusion  may  be  a  suggestion 
of  having  a  tear  production  test,  when  the  pa¬ 
tient  is  hypermetrope  and  astigmatic.  In  other 
words,  human  expert  docs  not  have  all  neces¬ 
sary  information  ahead  of  decision  making. 
He  must  reason  what  should  be  done  when 
meeting  with  incomplete  information.  This  is 
one  of  important  reasons  for  rule  hierarchy. 
Anoihef  reason  is  that  thcie  exist  some  im¬ 
portant  intermediate  concepts  or  decisions. 
This  also  contributes  to  forming  an  inference 
link  in  order  to  reach  the  final  conclusion. 

B.  Theory  Dependency.  Rules  may  be  derived 
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from  a  single  example  with  aid  of  domain 
knowledge.  Theory  plays  a  central  role  in  jus¬ 
tifying  derived  rules. 

C.  Incrementality.  The  human  knowledge  is 
accumulative.  When  new  experience(an  ex¬ 
ample  in  our  case)  comes  up,  human  expert 
can  refine  the  old  knowledge  base.  The  prob¬ 
lem  in  our  case  is  how  to  modify  the  generated 
rule  base  to  include  the  random  new 
examples. 

D.  Generality.  An  extensive  generalization  is 
done  in  deriving  rules.  One  essential  generali¬ 
zation  is  the  introduction  of  the  variables 
from  objects.  This  explains  the  predictive 
power  of  human  knowledge. 

E.  Uncertainty.  Uncertainty  is  typical  of  human 
knowledge.  People  frequently  do  not  know  if 
an  example  belongs  to  a  class  surely.  Instead, 
they  prefer  to  make  assertions  in  a  plausible 
way. 

4.  An  Integrated  Approach;  IRl. 

In  order  to  include  the  above  improvements, 
we  create  a  new  approaeh  of  integrating  similari- 
ty-based(SBL)  and  explanation-based 
mcthods(EBL),  namely,  IRl. 

4.1.  Integration  Issues 

Integrating  SBL  and  EBL,  or  preferably 
empirical  and  analytical  learning,- is  considered 
to  be  feasible  to  avoid  shortcomings  of  them 
when  applied  individually.  A  lot  of  researchers 
are  working  on  different  integration 
modcIs(Lebowitz,  1986;  Danyluk,  1987; 
Kedar-Cabelli,  1987;  Pazzani,  1988).  A  striking 
similarity  shared  by  these  models  is  that  one 
form  of  learning  is  invoked  first  and  the  results 
are  integrated  by  the  other.  This  one-way 
interaction  is  not  the  way  man  acquires  know¬ 
ledge  and  is  not  expected  to  be 
efficient(Swaminathan,  1989).  This  fallacy  is  stu¬ 
died  and  tackled  by  Swaminathan(1989).  How¬ 
ever,  his  model  works  on  concept  formation. 
New  points  must  be  considered  when  trying  to 
derive  inference  rules  usable  in  KBSs.  Let  us  go 
into  details  of  IRl  (see  Fig.  2). 

IRl  has  two  entries  which  correspond  to 
two  running  versions;  incremental  or  in  batch. 
Initially,  no  rules  but  a  large  set  of  examples  exist 
and  IRl  will  be  invoked  in  batch  to  trigger  em¬ 
pirical  components.  A  set  of  inference  rules  will 
be  derived  and  the  system  is  ready  to  explain  new 
examples.  Each  time  a  new  example  gels  in,  IRl 


will  invoke  its  analytical  component  EBL  to  ex¬ 
plain  the  example.  When  explanation  result  is 
inconsistent,  incorrect  or  incomplete,  empirical 
components  will  be  invoked  incrementally  to 
update  the  rules  generated  before.  The  update  is 
guided  by  making  full  use  of  explanation  results 
which  determine  which  rules  are  to  be  modified. 
This  procedure  will  be  repeated  until  a  satisfacto¬ 
ry  explanation  is  achieved  or  the  new  example  is 
found  contradict  with  an  old  one.  As  a  result,  the 
updated  rules  arc  able  to  cover  new  example  as 
well  as  old  ones.  At  this  moment  we  see  that 
incrementality  is  achieved  by  adopting  EBL 
techniques,  and  also  that  a  mutual  integration  is 
done  by  an  iterative  interaction  between  empiri¬ 
cal  and  analytical  components. 

BATCH;  Generating  classification  rules  with 
CLASS; 

Establishing  a  primary  rule  hierarchy  with 
EPH; 

Generalizing  rules  with  GR; 

Refining  hierarchy  with  ERH; 
INCREMENTAL;  Explaining  new  examples 
with  analytical  component  EBL; 

Triggering  empirical  components  toupdate 
rules  in  terms  of  cxplanationrcsults  as  bellow; 
switch  (rcsult.typc) 

{ case  perfect-explanation; 
outputing  successful  explanation  informa¬ 
tion; 

making  explanation-based  generalization 
with  EBG; 
break; 

case  incorrect-explanation; 
modifying  rules  and  hierarchy  with 
Incremental  CLASS  and  EPH; 

regeneralizing  rules  with  Incremental  GR; 
Refining  hierarchy  with  Incremental  ERH; 
break; 

case  inconsistent-explanation; 

same  as  above  but  with  different  arguments; 

case  incomplete-explanation; 

regeneralizing  rules  with  Incremental  GR; 

refining  hierarchy  with  Incremental  ERH; 

break; 

case  counter-example; 
outputing  the  old  example  inconsistent  with 
the  new  one; 
break; 

I 

repeating  above  explanation  and  modifica¬ 
tion  until  a  successful  explanation  or  a 
counter-example; 
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Fig.2  IRl's  flowchart  in  C-Iike  language 

4.2.  Representation  Issues 

An  example  in  IRl  is  a  conjunctive  combi¬ 
nation  of  assertions  about  objects.  The  assertion 
is  the  most  primitive  semantic  unit  and  can  be 
generally  represented  as: 

(OBJECT,  ATTRIBUTE,  VALUE,  CER¬ 
TAINTY), 

where  OBJECT  is  a  specifie  instance  of  a 
concept.  For  example,  '"'John  is  surely  24  years 
old'''  is  represented  as: 

(John,  Age,  24,  surely), 

where  John  is  a  specific  instance  of  man.  This 
representation  makes  it  possible  to  form  varia¬ 
bles  from  a  set  of  objects.  However,  this  basic 
representation  is  not  enough  to  deal  with 
uncertainty.  So,  we  explicitly  represent 
uncertainty  and  appropriately  taking  it  into  ac¬ 
count  in  forming  a  rule.  The  uncertainty  is 
basically  of  two  kinds:  fact  and  rule.  We  deal 
with  the  factual  uncertainty  by  creating  new  val¬ 
uations  for  attributes.  The  rule  uncertainty  is 
semantically  identical  to  the  uncertainty  of  its 
conclusions.  The  uncertainty  of  attributes  and 
conclusions  arc  unified  into  the  assertion 
uncertainty.  Human  experts  tend  to  deal  with 
uncertainty  through  qualitative  symbols,  or 
specifically  linguistic  words.  Precisely,  their  des¬ 
cription  of  assertion  uncertainty  is  discrete.  So, 
we  can  represent  uncertainty  by  segmenting  the 
quantitative  description  into  qualitative  one.  For 
example,  if  the  decision  is  about  the  weather,  the 
basic  classifications  arc  clear,  cloudy,  rain.  We 
have: 

the  weather  is 

1.  clear  2.  cloudy  3.  rain. 

When  we  arc  not  sure  it  will  rain,  a  number,  cal¬ 
led  credit,  will  be  associated  with  rain.  We  first 
quantify  it  into  maybe,  mostly,  certainly.  The 
new  values  arc  added  in  to  replace  "rain",  i.c., 

3a.  maybe  rain 
3b.  mostly  rain 
3c.  ccrtfiinly  rsin. 

In  this  way,  the  above  basic  representation  is 
transformed  into  a  3-tuplc: 


VALUE), 

which  is  the  right  way  of  representing  an  asser¬ 
tion  for  examples  in  IRl. 

At  the  end  of  rule  formation,  certainty  of  an 
assertion  is  again  separated  away.  That  is,  the 
certainty  about  rule  is  explicitly  represented  in 
IRl  and  this  is  consistent  with  human  cognition. 
In  fact,  at  the  generalization  stcp(see  Section 
4.3.3)  of  IRl,  an  effort  is  made  to  remove  ccr- 
tainty(or  called  credit)  from  rules.  A  numerical 
rule  will  be  generated  to  compute  the  credit  of  a 
conclusion  from  credits  of  conditions.  So,  a  rule 
is  a  logical  combination  of  assertion,  which,  is 
represented  in  predicate  logic. 

IRl  also  has  represented  a  concept  frame¬ 
work  hierarchically.  This  is  to  describe  the  gener¬ 
alization  and  specialization  relation  between 
concepts.  Also  any  object  must  belong  to  a  spe¬ 
cific  concept  in  order  to  form  a  predicate  as  in 
Step  Ic  of  Section  4.3.3. 

In  the  following  section  of  the  paper,  exam¬ 
ples  arc  still  written  as  an  attribute-value  pair  to 
simplify  writing. 

4.3.  Empirical  Components 

A  group  of  new  algorithms  arc  created  in 
IRl  to  perform  empirical  tasks.  They  make  IRl 
distinguished  from  other  empirical  as  well  as  in¬ 
tegrated  learning  systems.  Two  features  are 
shared  by  IRl's  empirical  components.  First, 
they  arc  designed  to  operate  both  incrementally 
and  in  batch,  depending  arguments  supplied.  In 
the  ease  of  incremental  running,  arguments  supp¬ 
lied  must  be  able  to  determine  the  scope  of  rules 
to  be  modified.  Second,  any  comparison  or  com¬ 
putation  must  be  done  with  credit.  The  examples 
belonging  to  the  same  class  must  have  the  same 
credit.  Now,  let  us  outline  them  briefly. 

4.3.1.  Generating  Classification  Rules  with 
CLASS 

The  procedure  CLASS  is  similar  to  PRISM. 
The  difference  lies  in  their  processing  the  equal 
priority  of  two  or  more  attribute-value  pairs. 
PRISM  constructs  its  rules  by  selecting  an  attri¬ 
bute-value  pair  one  by  one.  The  selecting  se¬ 
quence  determines  the  simplicity  of  the  rules.  The 
selecting  standard  is  to  compute  the  priority  for 
each  attribute-value  pair  (Cendrowska,  1987). 
The  equal  priority  for  two  or  more  pairs  happens 
sometimes.  IRl  adopts  a  thorough  search  which 
will  guarantee  the  simplicity  of  the  rules. 

4.3.2.  Establishing  a  Primary  Hierarchy  With 
EPH 


(OBJECT  ATTRIBUTE  EXTENDED 
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Wc  associate  a  number,  called  cost,  with 
each  selected  attribute  which  is  usually  difficult 
or  costly  to  be  valuated  and  is  called  as 
undetermined  attribute.  The  number,  ranging 
from  1  to  100,  indicates  the  availability  of  the  at¬ 
tribute.  We  will  use  a  simple  example  to  illustrate 
the  procedure  ERH.  Assume  such  a  rule: 

a2&b3&d2&c4=  =  >@1 

If  the  cost  of  d  and  e  is  respectively  20  and  30,  we 
will  decompose  the  rule  into: 

a2&b3=  =  >TEST(d) 

TEST(d)  &  a2.&  b3  &  d2=  =  >TEST(c) 
TEST(e).&a2&b3&d2&c4=  =  >@1, 

where  TEST  means  an  action  of  imple¬ 
menting  a  test  to  valuatc  the  corresponding  at¬ 
tribute.  In  this  way  we  establish  a  hierarchy:  test¬ 
ing  d,  testing  e  and  making  decision.  Assume  d  in 
Table  I  is  an  undetermined  attribute,  EPH  may 
generate  a  few  additional  rules  such  as: 

c2  &  bl  =  =  >test(d) 
al  &c2=  =  >tcst(d) 
cl  «&  b2=  =  >tcst(d) 
cl  &  al  =  =  >test(d) 
cl  &  a2=  =  >tcst(d). 

4.3.3.  Generalizing  Rules  With  GR 

Generalization  was  exhaustively  investigated 
in  AQl  1  by  Michalski  and  Stcpp(1983).  Wc  bor¬ 
row  their  approach  with  some  modifications  to 
fit  our  case.  For  instance,  seeds  arc 
predetermined  and  it  is  unnecessary  to  select 
seeds.  The  resulting  procedure,  namely  GR,  is 
characterized  as  bellow: 

Step  1.  Stars  arc  constructed. 

In  our  ease,  a  star  G(@i)  is  defined  as  a  set  of 
all  maximally  general  complexes  (a  logical  des¬ 
cription  of  a  class)  covering  all  the  examples  of 
class  @  i  and  not  covering  other  examples  in 
the  training  set.  The  construction  begins  with 
the  rule  obtained  through  above  steps.  The 
rules  arc  generalized  as  much  as  possible  by 
GEN  operators  until  they  cover  the  other  ex¬ 
amples.  GEN  operators  partly  come  from 
AQll  and  partly  are  created  by  the  authors. 
The  resulting  stars  arc  not  needed  to  be  re¬ 
duced  as  done  in  AQll  in  order  to  cover  as 
much  as  possible  from  an  incomplete  training 
set. 

la.  To  linear  selectors,  the  "closing  the  interval" 


rule  is  first  applied.  For  instance,  1,  2,  3,  7,  8 
is  turned  intO  1..3,  7..8  or  1..8  according  to  a 
threshold  of  how  much  generalization  can  be 
done.  Specifically  for  integer  selector,  some 
"conceptual  abstraction"  rules  arc  applied. 
For  instance,  2,  4,  8,  16  is  turned  into  square 
of  an  integer  n.  4,  7,  10,  13  is  turned  into 
3n+l. 

lb.  To  structured  selectors,  the  "climbing  the 
generalization  hierarchy"  rule  is  applied.  For 
instance,  bl,  b2,  b34,  b21  is  turned  into  b, 
where  bl,  b2,  b34  and  b21  arc  the  specific 
eases  of  the  structure  b. 

lc.  To  any  objects,  the  "predicate  formation" 
rule  is  applied.  A  set  of  objects  belonging  to 
the  same  concept  and  sharing  the  same 
preconditions  arc  merged  into  a  variable  ap¬ 
pearing  in  a  predicate.  The  predicate  is  usual¬ 
ly  corresponding  to  an  attribute. 

ld.  To  any  selectors,  the  "Augmenting  selectors" 
rule  is  applied:  the  cnumcrativc  forms  of  sev¬ 
eral  selectors  arc  combined  into  a  relational 
form  among  them.  For  example,  a  =  1,  2, 5, 7, 
12  and  b  =  3, 4, 7, 9, 14  is  turned  into  a  =  b-2. 
The  rule  can  be  generally  stated  as  searching 
for  numerical  regularity  from  data  of  the  va¬ 
riables.  This  problem  was  exhaustively  inves¬ 
tigated  by  W'  u(1988). 

Ic.  After  steps  la,  lb  and  Ic,  the  "dropping  the 
condition"  rule  is  applied  to  all  selectors:  A 
selector  is  removed  if  this  docs  not  exceed  the 
generalization  threshold. 

lf.  To  all  classification  rules,  the  "merging  into  a 
general  common  form"  rule  is  applied.  This  is 
done  on  the  basis  of  a  belief  that  different 
classification  rules  should  have  some  symmet¬ 
rical  forms.  If  two  rules  arc  not  symmetrical, 
the  specific  one  should  be  changed  into  the 
more  general  form  in  order  to  make  them 
symmetrical.  This  substep  is  of  some  value  for 
refining  hierarchy. 

lg.  To  all  rules,  credit  eomputation  rules  arc  gen¬ 
erated.  Explicitly  representing  credits  of  at¬ 
tribute-pairs  in  rules  will  lead  to  too  much 
specific  rules.  Wc  try  to  eliminate  credits 
from  rules  and  generate  special  rules  to  com¬ 
pute  credits  of  conclusions  from  credits  of 
conditions.  This  just  like  the  numerical  dis¬ 
covery  from  a  set  of  discrete  data.  Here,  wc 
use  a  simplified  Reduction  (Wu,  1988)  to  fin- 
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ish  the  task. 

Step  2.  Stars  arc  optimized. 

An  optimized  classification  is  built  by  selecting 
and  modifying  complexes  from  stars  to  make 
them  mutually  disjoint  by  oroccdurc  NID  as  in 
AQll. 

Step  3.  A  Primary  Termination  Criterion  PTC  is 
evaluated. 

This  criterion  is  based  on  AQll  with 
some  modification  of  the  classification 
quality  evaluation  function  LEF.  The 
biggest  difiercncc  lies  in  IRFs  neglect  of 

the  sparscncss - a  measurement  of  the 

difference  between  the  input  training  set 
and  the  examples  that  the  rules  cover. 
This  is  because  we  want  the  induced  rules 
as  instructive  as  possible  to  deal  with  the 
more  new  eases. 

4.3.4.  Refining  Hierarchy  With  ERH 

Refining  the  primary  rule  hierarchy  ob¬ 
tained  in  Step  3.3  is  finished  by  abstracting  simi¬ 
lar  selectors  in  all  rules.  A  combination  of  some 
common  selectors  arc  grouped  together  to  form 
an  intermediate  node  in  the  hierarchy.  This  node 
is  usually  corresponded  to  a  concept  in  the  do¬ 
main.  Thus,  we  encourage  domain  experts  to 
point  out  the  important  intermediate  concepts  to 
help  form  a  refined  hierarchy.  Besides,  IRl  also 
supports  an  automated  formation  of  the  refined 
hierarchy.  Let  us  follow  an  example.  Suppose  the 
initial  subset  of  the  rules  arc  as: 

al  &bl  &cl  &dl=  =  >@1 
a3&b2&cl&dl  =  =  >@l 
al  &bl  &c3=  =  >@2 
a3&b2&c3  =  =  >@2 
a4&c2&d3  =  =  >@2. 

refining  is  done  in  a  four-step  procedure 
ERH(Establishing  Refined  Hierarchy): 

Step  1.  Merge  rules  into  a  disjoint  form.  The 
above  subset  arc  combined  into: 

(al  &bl  va3&b2)&cl  &dl  =  =  >@1 
(al  &bl  va3&b2)&c3=  =  >@2 
a4  &  c2  &  d3  =  =  >  @2. 

Step  2.  Replace  the  common  sub-expression 
with  a  new  concept  introduced.  In  this  example, 
the  new  concept  I  is  defined  as  a  boolean  combi¬ 
nation  of  a  and  b.  That,  is: 


al&blva3&b2=  =  >Il 
a4=  =  >12 

II  &cl  &dl=  =  >@1 
II  & c3=  =  > @2 
12&c2&d3=  =  >@2. 

Step  3.  Split  the  combined  rules.  The  purpose  is 
to  simplify  the  rules  into: 

al  &bl=  =  >Il 
a3&b2=  =  >Il 
a4=  =  >12 
II  &cl  &dl=  =  >@1 
II  &c3=  =  >@2 
12&c2&d3=  =  >@2. 

Step  4.  The  Hierarchy  Termination  Criterion 
HTC  is  evaluated.  If  the  hierarchy  quality  is  not 
satisfied,  go  back  to  Step  2  to  consider  other 
common  sub-expressions  until  the  quality  is  no 
longer  improved.  The  quality,  being  computed 
through  HTC,  is  determined  by  four  factors: 

4a.  Simplicity  of  a  single  rule.  The  long  rule  (in¬ 
volving  many  selectors)  is  usually  considered 
as  a  poor  one  and  needs  decomposition  by  es¬ 
tablishing  intermediate  conccpts(nodcs).  On 
the  other  hand,  too  short  rules  arc  not  wel¬ 
comed  either. 

4b.  The  number  of  possible  valuations  of  the  in¬ 
troduced  concept.  The  less  valuations  will 
make  the  rules  for  classifying  the  new  con¬ 
cept  arc  relatively  simple  and  thus  of  great 
significance. 

4c.  The  number  of  the  levels  of  the  hierarchy. 
The  new  concepts  arc  encouraged  to  form 
other  new  concepts.  The  number  must  also  be 
proportional  to  the  number  of  the  rules.  This 
is  because  we  want  a  tree  whose  branches  arc 
comparable  to  its  depth.  In  a  graphical  term, 
the  tree  should  be  neither  too  "wide'  nor  too 
"deep". 

4d.  The  number  of  the  total  rules.  The  less  rules 
arc  extremely  encouraged. 

4.4.  Analytical  Issues 

Let  us  go  back  to  Fig.  2.  Empirical  compo¬ 
nents  of  IRl  first  generate  a  set  of  inference  rules 
which  arc  also  to  act  as  the  domain  theory  for 
explanation-based  learning.  When  a  new  exam¬ 
ple  comes  in,  the  rules  involving  variables  are  in¬ 
voked  to  explain  the  example.  The  explanation  is 
also  viewed  as  problem  solving  and  explanation 
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result  is  the  solution  to  the  example. 

According  to  the  explanation  result,  differ- 
ent  strategies  are  adopted.  The  analytical  com¬ 
ponent,  i.e.,  explanation-based  generalization, 
only  happens  when  having  a  perfect 
cxplanation(or  a  workable  solution  to  the  exam¬ 
ple  problem).  New  domain  rules  are  to  be  gener¬ 
ated  by  generalizing  the  explanation  structure. 
Mitchell  et  al(1986)  presented  a  standard  expla¬ 
nation-based  generalization  technique;  modified 
goal  regression.  We  first  implemented  their  mod¬ 
el  without  much  modification  and  named  it  as 
MEBG(Mitchell'sEBG). 

However,  as  DeJong  and  Mooney(1986) 
pointed  out,  Mitchell's  EBG  is  just  one  part  of 
what  explanation-based  learning  means.  They 
specifically  emphasize  structural  generalization. 
In  fact  DcJong(1988)  has  a  more  systematic  view 
on  what  kinds  of  generalization  can  be  done  for 
EBL.  Structural  generalization  involves  changing 
the  internal  structure  of  the  explanation.  This  ac¬ 
tion  is  more  opcn(not  well  guided  or  constrained) 
and  its  result  is  of  greater  significance.  So,  we  al¬ 
so  implemented  our  second  analytical  compo¬ 
nent  DMSG(DcJong  and  Mooney's  Structural 
Generalization). 

Our  primary  working  domain  is  the 
hepatitis  diagnosis.  The  hepatitis  comprises  Type 
A  and  Type  B.  We  originally  hoped  that  rules  for 
two  types  could  be  learned  from  each  other.  That 
is,  DMSG  is,  in  some  sense,  an  analogical  learn¬ 
ing.  In  fact,  Ellman(1989)  has  argued  that  EBL 
can  be  viewed  as  analogieal  learning.  This  em¬ 
phasis  makes  its  learning  less  blind  and  relatively 
easy.  In  fact,  the  current  version  oflRl  irnnle- 
ments  Carbonell(1983)'s  transformational  analo¬ 
gy.  Of  course,  we  arc  also  ready  to  have  more 
complicated  explanation-based  generalizations, 
such  as  derivational  analogy(Carboncll,  1986), 
number  generalization  and  temporal  gcncraliza- 
tionCDcJong,  1988). 


5.  Experiments 

To  have  a  flexible  architecture,  each  com¬ 
ponent  of  IRl  was  implemented  as  an  indepen¬ 
dent  module  and  can  be  run  in  batch(on  all  rules 
and  classes)  or  incrcmcntally(on  selected  rules 
and  classes).  Components  can  be  arbitrarily 
combined  to  accomplish  the  designed  task.  Two 


incremental  IRl.  This  flexible  architecture  also 
facilitates  debugging  greatly  because  each 
module  stores  its  output  into  a  file  which  reveals 
intermediate  results  to  developers  and  users. 

Our  first  experiment  is  the  problem  stated 
above.  Given  a  training  set  as  in  Table  1,  IRl 


generated  the  desired  results  described  in  the  last 
section.  When  given  a  subset  of  training  exam¬ 
ples,  Incremental  IRl  also  generated  the  desired 
results  as  other  examples  input  in.  An  interesting 
phenomenon'  is  that  the  rules  can  be  derived 
without  inputing  example  No.  17,  18.  IRl's  GR 
module  succeeded  in  updating  its  rules  to  include 
Rule  No.7  as  in  Fig.  1.  No  variables  were  gener¬ 
ated  for  rules  and  analytical  learning  was  not 
verified  in  this  experiment. 

The  second  experiment  intended  to  verify 
IRl  in  a  complex  real-world  context.  We  chose 
hepatitis  diagnosis  as  a  working  domain,  partly 
because  its  epidemic  sometimes  leads  to  a  disas¬ 
ter  and  partly  because  it  is  typical  of  intermediate 
concepts  or  testing  an  unknown  attribute.  An 
appropriate  rule  hierarch  sharing  many 
commonalities  with  doctor's  diagnosis  rule  is 
generated  from  a  set  of  medical  cascs(cxemplary 
patients  with  diagnosis)  and  refined  from 
accumulative  new  cases.  More  interesting  results 
appeared  in  its  analytical  learning  components. 
After  MEBG  was  applied,  it  was  ironically  found 
that  new  rules  generated  by  MEBG  are  just  ones 
putting  the  intermediate  conclusions  together.  In 
other  words,  new  rules  have  no  hierarchy  for  the 
final  diagnosis.  That  MEBG  just  goes  back  to 
prism's  classification  rule,  with  variables 
added.  In  our  context,  this  is  not  welcome. 
Fortunately,  wc  found  that  DMSG  is  quite  use¬ 
ful.  It  successfully  generated  a  few  rules  for  Type 
A  hepatitis  diagnosis  from  explaining  Type  B 
hepatitis  diagnosis  rules. 

A  point  should  be  made  about  this  experi¬ 
ment.  Wc  originally  aimed  to  use  crupiricul  com¬ 
ponents  to  set  up  the  domain  theory  to  escape 
EBL's  strong  dependence  on  a  perfect  domain 
theory.  Wc  had  hoped  only  inputting  a  set  of 
training  instances,  probally  with  a  conceptual 
framework.  However,  the  domain  theory  gener¬ 
ated  by  IRl's  empirical  components  arc  not 
enough  to  play  a  role  of  justifying  thc  new  rules 
from  making  an  analogy  with  the  old  one.  The 
justification  needs  a  support  from  the  medical 
theory.  So,  wc  also  embedded  some  basic  medi¬ 
cal  theoretical  knowledge  into  IRl  to  have  an 
explanation-based  analogy. 


6.  Concluding  Remarks 
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work  of  inducing  expert  rules  from  a  large  set  of 
examples  and  refining  rules  from  accumulative 
examples.  Its  greatest  feature  is  a  mutual  integra¬ 
tion  of  empirical  and  analytical  learning  tech¬ 
niques.  Its  learning  behavior  is  very  similar  to 
that  a  human  being  acquires  knowledge. 
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Facts(cxamplcs)  arc  collected  and  analyzed  into 
a  thcory(rulcs)  which  is  then  updated  or  verified 
by  new  facts  gradually(incrcmcntally).  Also  its 
outpu,'.  (rule)  is  also  similar  to  expert  rule  in  a 
number  of  dimensions.  Last,  a  new  group  of 
learning  algorithms  arc  presented  to  accomplish 
these  ^oals.  All  of  these  eForts  make  it  close  to 
automated  rule  acquisition  lor  KBSs. 

However,  when  stepping  to  a  practical  rule 
learning  system,  we  inevitably  find  some  obsta¬ 
cles.  The  most  serious  one  is  that  empirical  as 
well  as  analytical  learning,  especially  generaliza¬ 
tion,  needs  theoretical  guidance.  This  theory  is  a 
theory  about  rule  generation,  or  precisely 
meta-theory.  In  the  incremental  case,-  IRl's  em¬ 
pirical  components  arc  directed  by  its  explana¬ 
tion  results.  However,  this  is  just  an  indication  of 
where  the  refining  should  be  made,  not  of  how 
the  refining  is  done.  The  same  problem  also  hap¬ 
pens  to  its  useful  analytical  component  DMSG, 
which  also  needs  to  be  directed  on  how  structural 
generalization  is  done.  Currently,  DMSG  only 
conducts  a  simple  structural  mapping.  More 
complex  generalizations  still  stand  away  from  a 
practical  approach.  For  man,  the  theory  corre¬ 
sponds  to  his  general  knowledge,  or  precisely 
common  sense  knowledge  or  world  knowledge. 
To  our  dismay,  incorporating  common  sense  in¬ 
to  machine  has  been  and  will  continue  puzzling 
AT  researchers. 

If,  one  day,  we  fortunately  find  some  ways 
of  common  sense  reasoning.  We  still  cannot  es¬ 
cape  a  dilemma:  where  and  how  the  common 
sense  knowledge  comes  from.  We  use  an  integra¬ 
tion  of  SBL  and  EBL  in  order  to  escape  EBL's 
dependence  on  existing  domain  theory,  but  we 
ironically  fall  into  a  trap  where  an  existing  com¬ 
mon  sense  theory  is  required. 

More  practically,  we  feel  a  demand  of  mak¬ 
ing  more  experiments  to  verify  this  complicated 
framework.  We  arc  planning  to  use  data  which 
have  been  used  by  other  learning  systems  to  have 
a  direct  comparison.  We  also  think  of  extending 
IRI's  explanation-based  generalization  to  gen¬ 
erate  more  significant  rules.  In  one  word  to 
summarize  our  approach,  much  has  been  done 
and  more  is  left  to  future. 
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Abstract 

Natural  language  interfaces  have  fallen  short  of 
their  goal  of  providing  full  freedom  of  expression 
in  human-computer  interactions  largely  because 
significant  increases  in  the  coverage  of  the 
grammar  are  usually  accompanied  by  intolerable 
decreases  in  system  performance.  We  demonstrate 
a  method  for  overcoming  this  barrier  called 
adaptive  parsing,  a  technique  in  which  the  system 
accommodates  the  user  by  dynamically  growing  its 
grammar  to  acquire  her  preferred  forms  of 
expression  and  recognize  them  directly  in  the 
future.  The  specific  learning  method  in  an 
implemented  adaptive  parser  is  discussed  in  detail 
and  shown  to  be  adequate  to  acquire  eight 
idiosyncratic  grammars  with  good  coverage  and 
good  performance. 

1.  Introduction 

The  philosophy  behind  the  design  of  natural  language 
interfaces  is  to  permit  the  user  the  full  power  and  ease  of 
her  usual  forms  of  expression  to  accomplish  a  task.  A 
pragmatic  consideration  at  odds  with  this  philosophy  is 
that  we  can  neither  write  down  a  full  grammar  for  English 
nor  anticipate  all  ungrammatical  or  idiomatic  utterances 
that  may  occur  in  spontaneously  generated  input.  As  a 
result,  the  fundamental  decision  in  most  interface  designs 
is  the  choice  of  a  restricted  subset  of  English  to  constitute 
the  recognizable  grammar. 

In  choosing  a  sublanguage,  an  interface  designer  is 
faced  with  a  dilemma.  A  small  grammar  with  minimal 
ambiguity  has  the  advantages  of  fast  processing  and  single 
interpretations  for  most  accepted  utterances,  but  the 
disadvantages  of  poor  coverage  and  brittleness  (accq}ting, 
for  example,  “Schedule  a  meeting  at  5  pm”  but  not  “Add 
a  5  pm  meeting  to  my  schedule”  or  even  “Schedule 
meeting  5  pm”).  As  the  size  of  the  sublanguage  grows, 
coverage  is  increased  but  processing  time  and  the  number 
of  inteipretations  produced  for  each  utterance  tend  to  grow 
as  well.  Further,  these  negative  factors  increase  without 
any  guarantee  that  the  extensions  made  to  the  grammar 


confram  to  the  needs  of  the  particular  user  who  must 
endure  the  performance  degradations. 

Solving  this  dilemma  requires  that  we  be  able  to  choose 
the  right  sublanguage  for  each  user.  Since  such  a  choice 
cannot  be  made  a  priori,  we  must  rely  on  the  interface  to 
accommodate  the  user’s  idiolect  automatically.  In  the  next 
section  we  introduce  a  model  of  language  learning  able  to 
acquire  an  idiosyncratic  grammar  through  repeated 
experience  with  a  user.  In  Sections  3  and  4  we  discuss  a 
particular  implementation  of  the  model  called  CHAMP.  In 
Section  5  we  demonstrate  that  the  system’s  recognition 
capability  and  response  ume  asymptotically  approach 
near-optimal  performance  given  any  stable  inherent 
ambiguity  in  the  user’s  grammar. 

2.  The  Model 

The  process  of  adaptive  parsing  is  presented  in  Figure 
1.  Part  (a)  shows  four  user  utterances  (ul-u4)  the  first  of 
which  lies  within  the  language  recognized  by  the  current 
kernel  grammar  (K)  and  the  remainder  of  which  do  not. 
To  say  that  ul  is  grammatical,  or  non-deviant  with  respect 
to  K,  means  that  there  is  a  set  of  grammaticd  forms  in  K 
that  map  the  utterance  ;o  hr.  apprq>riate  meaning  structure 
directly.  Analogously,  u2  through  u4  are  ungrammatical, 
or  deviant  with  respect  to  K,  because  there  are  no  such 
sets  of  forms.  Thus,  a  static  interface  with  restricted 
sublanguage  L(K)  will  accept  ul  but  reject  u2,  u3,  and  u4. 

Part  0?)  demonstrates  the  search  for  a  meaning  structure 
for  deviant  input  When  an  utterances  lies  outside  of  K,  we 
extend  failed  parse  paths  by  applying  a  general  recovery 
action  to  a  deviation  with  respect  to  a  grammatical  form. 
There  are  four  types  of  recovery  action  corresponding  to 
the  four  types  of  deviation  for  which  they  compensate; 
insertion,  deletion,  substiUition,  tnd  Uansposition.  Note 
that  the  set  is  complete— every  utterance  can  be  mapped 
into  a  set  of  grammatical  forms  by  zero  or  more 
applications  of  these  actions. 

As  an  example  corresponding  to  Figure  1(b),  consider 
u2  =  “Arrange  a  meeting  5  pm”  and  a  grammar  K  in 
which  postnominal  cases  must  be  marked  by  valid 
prepositions  and  “arrange”  is  unknown.  A  parse  using  K 
alone  (Deviation-level  0,  or  simply,  Level  0)  will  reject 
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(a)  Utterances  pt^tioned 
by  membership  in  L(K). 


L(K)  =  language  recognized 
by  kernel  granunar  K 

ul-u4  =  user  utterances 


(b)  Utterances  partitioned  by 
deviance  with  respect  to  K. 


O  Deviation-level  0 
O  Deviation-level  1 
O  Deviation-level  2 
O  deviation-levels  >2 


(c)  Utterances  partitioned  by 
deviance  with  respect  to  K*. 


L(K')  =  language  recognized 
by  adapted  kernel  K* 


Figure  1:  Adaptive  Parsing.  In  response  to  u2,  an  utterance  outside  of  L(K),  the  search  for  a  meaning  is  extended 
beyond  L(K)  in  a  least-deviant-first  manner.  Understanding  u2  results  in  adding  new  forms  to  K,  creating  the 
ad2q)ted  kernel  K’.  Since  the  new  forms  recognize  generalizations  of  the  deviations  in  u2,  other  utterances 
in  the  user’s  language  space  may  be  less  deviant  with  respect  to  K’  than  they  were  with  respect  to  K. 


u2.  By  extending  the  parse  to  allow  one  recovery  action 
we  may  either  interpret  “arrange”  as  a  substitute  for  a 
recognizable  verb,  or  account  for  the  deletion  of  the 
posmoniinal  marker,  but  not  both  on  the  same  path.  Thus, 
the  parse  will  fail  at  Level  1  as  well.  Allowing  two 
recovery  actions  along  a  path,  however,  we  find  a 
mapping  from  the  forms  in  K  to  a  meaning  structure  for 
u2. 

As  the  example  indicates,  we  extend  the  search  in  a 
least-deviant-first  manner,  exploring  grammatical 
interpretations  frrst.  Then,  if  no  meaning  is  found,  paths 
containing  a  single  recovery  action  are  considered,  then 
two  recovery  actions,  and  so  on.  The  use  of  a  general, 
composable  method  of  recovery  applicable  at  any  point  in 
the  input  distinguishes  error  recovery  in  our  model  from 
previous  relaxation  techniques  that  restricted  recovery  to 
speciflc  transformations  on  specific  constituents  in  the 
grammar  [2]  [5]  [9]  [11]. 

Another  way  in  which  our  model  differs  from  previous 
techniques  is  that  deviations  do  not  remain  deviations. 
Figure  1(c)  represents  the  adaptation  process.  During 
adaptation  the  current  kernel  is  augmented  with  new 
grammatical  components  that  capture  the  general  form  of 
the  deviations  described  by  a  meaning  structure  and  its 
recovery  actions.  The  ability  to  learn  more  than  one  new 
grammar  component  from  an  utterance,  as  well  as  the 
ability  to  learn  both  lexical  definitions  and  syntactic 
forms,  distinguishes  our  method  of  adaptation  from 
previous  methods  that  could  learn  only  a  single 
component  and,  in  general,  only  a  single  type  of  linguistic 
knowledge  [1]  [3]  [4]  [8]  [10]  [12]. 

If  we  arbitrarily  choose  a  sublanguage  for  a  non- 
adaptive  interface  there  is  no  guarantee  that  the  trade-off 


between  coverage  and  performance  is  to  any  particular 
individual’s  advantage— additional  forms  may  contribute 
to  ambiguity  in  the  search  space  without  corresponding  to 
the  user’s  preferred  forms  of  expression.  In  an  adaptive 
environment  the  trade-off  still  exists;  adaptation  may 
bring  ambiguity  into  the  cunent  kernel.  But  if  the 
individual  tends  to  rely  on  previously  accepted  forms  of 
expression  such  increases  may  still  be  advantageous.  In 
prior  work  we  demonstrated  that  although  users  differ 
significantly  from  each  other  in  their  preferred  forms  of 
expression,  with  frequent  use  they  show  regular, 
self -bounded  linguistic  behavior— a  tendency  to  rely  cn 
those  fwms  that  have  worked  in  the  past  [6]  [7].  This 
behavioral  regularity  means  that  once  accepted,  an 
idiosyncratic  form  is  likely  to  be  reused.  The  more  often  it 
is  reused,  the  more  advantageous  adaptation  becomes;  by 
modifying  the  grammar  to  recognize  a  deviant  form 
directly,  subsequent  encounters  with  that  deviation  will 
not  require  error  recovery.  In  addition,  future  sentences 
that  share  the  learned  deviation  will  be  interpretable  at 
lower  deviation-levels,  requiring  less  search. 

Figure  1  shows  adaptation’s  bootstrapping  effect 
graphically.  Assume  an  implementation  of  th''  model  with 
a  limit  on  search  of  two  deviations  (some  limit  is  needed 
to  enforce  reasonable  system  response  and  preclude 
nonsensical  interpretations).  Also  assume  that  each  of  u2, 
u3,  and  u4  uses  “add”  as  its  verb.  Part  (b)  shows  that  u4 
is  interpretable  at  Level  1  but  that  u3  would  be  outside 
L(K)  even  with  error  recovery.  Once  u2  is  brought  into 
L(K),  however,  the  deviances  of  u3  and  u4  are  defined  in 
relation  to  the  adapted  kernel  K’:  u4  is  no  longer  deviant 
(having  been  brought  into  L(K’)  with  u2)  and  u3  is  now 
reachable  through  recovery.  Thus,  in  an  adaptive  interface 
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all  three  utterances  are  accepted  without  rephrasing,  and 
only  u2  and  u3  require  search  beyond  Level  0.  In  a  system 
with  error  recovery  alone,  u2,  u3,  and  u4  would  require 
extended  search  (to  accept  u2  and  u4  and  reject  u3),  and 
u3  would  require  rephrasing.  In  a  static  interface  with 
restricted  sublanguage  L(K),  each  of  u2,  u3,  and  u4  would 
be  rejected,  requiring  additional  search  for  at  least  one 
rephrasing  per  utterance. 

3.  CHAMP,  an  Adaptive  Interface 
The  model  described  above  has  been  embedded  in  a 
working  interface  named  CHAMP  (CHAMeleonic  Parser) 
which  is  implemented  in  COMMON  USP  on  an  IBM  RT.  In 
addition  to  learning  idiosyncratic  grammars,  CHAMP’s 
task  is  to  aid  a  user  in  maintaining  a  schedule  of  events 
and  assist  in  arranging  airline  reservations  for  events  that 
require  travel.  The  number  of  general  actions  and  object 
types  in  the  system  is  fairly  small  (about  fifty),  although 
the  number  of  specific  objects,  such  as  particular  names 
and  locations,  is  unconstrained.  A  full  description  of 
CHAMP  can  be  found  in  [6]. 

The  system  uses  a  bottom-up,  semantically  and 
pragmatically  constrained,  least-deviant-first  parsing 
algorithm.  CHAMP  performs  deviation  detection  and 
recovery  in  terms  of  the  four  general  recovery  classes 
(insertion,  deletion,  substitution,  and  transposition), 
allowing  at  most  two  deviations  in  an  utterance. 
Deviations  are  detected  with  respect  to  syntactic  forms  and 
lexemes  in  the  current  grammar.  A  form  is  a  declarative 
structure  specifying  the  requirements  for  assigning  a 
segment  of  the  utterance  to  a  grammatical  category. 

To  understand  CHAMP’s  parsing  algorithm,  let  us 
again  consider  the  sentence  “Arrange  a  meeting  5  pm.” 
Simplified  versions  of  two  forms  used  in  understanding 
the  sentence  can  be  seen  below: 

ACTl  isa  addform 
strategy: 

701  m-hourform 

702  m-intervalform 

703  m-dateform 

704  addword 

708  m-i-ggform 

nc\A  ^r\Q 

f  \r*T  rw 

unordered:  701  702  703 
exclusive:  701  702 

The  first  form,  ACTl,  recognizes  references  to  adding  an 
entry  to  the  calendar.  ACTl  requires  that  the  input  contain 
at  least  two  contiguous  constituents:  a  verb  in  the  class 
addword  and  an  indefinitely  marked  group-gathering  (m- 
i-ggform).  The  form  allows  the  verb  to  be  preceded  by 
prepositional  phrases  describing  a  time  slot  (by  start  hour 


or  interval  but  not  both)  and  a  date,  in  either  order.  HRO  is 
the  kernel  form  that  recognizes  the  marked  hour 
(m-hourrorm)  called  for  in  step  701  of  ACTl.  HRO 
requires  as  contiguous  constituents  a  preposition 
(hourmarker)  and  an  unmarked  hour. 

Figure  2  shows  a  portion  of  the  least-deviant-first  parse 
for  “Arrange  a  meeting  5  pm.”  For  clarity,  we  include 
only  those  constituents  that  are  part  of  the  final  parse  tree. 
The  bottom  of  the  figure  shows  the  first  step  in  parsing: 
segmenting  the  tokens  in  the  input  according  to  the  class 
information  attached  to  the  words  and  phrases  in  the 
lexicon.  Thus,  “pm”  is  mapped  to  nightword  while 
“arrange”  is  m^pcd  to  the  special  class  unknown.  A 
wcvd  or  phrase  may  belong  to  more  than  one  class;  a 
unique  leaf  node  is  created  for  each  meaning. 

After  segmentation,  the  bottom-up  parsing  algorithm  is 
run  at  Level  0.  In  our  example,  most  of  the  work  done  at 
this  level  consists  of  assigning  leaf  nodes  to  strategy  steps 
that  call  for  a  member  of  the  leafs  class.  The  only 
constituent  constnicted  at  this  time  is  an  unmarked  hour;  it 
is  created  by  binding  the  number  “5”  to  step  4  and  the 
nightword  “pm”  to  step  9  in  HRl. 

Since  no  complete  parse  could  be  constructed  at  Level 
0,  CHAMP  continues  the  search  by  allowing  a  single 
deviation  along  any  path  in  the  tree.  In  our  example, 
tolerating  a  single  grammaticality  causes  a  series  of 
events.  First,  a  marked  hour  is  constructed  via  HRO  by 
binding  the  previously  built  unmarked  hour  to  step  112 
and  allowing  a  deletion  deviation  of  the  required,  missing 
step  111.  A  deletion  annotation  records  the  recovery. 
Next,  the  marked  hour  is  attached  to  the  meetingword  to 
create  a  group-gathering  via  GGl.  The  group-gathering  is 
then  attached  to  the  indefmarker  to  create  an  m-i-ggform 
via  IGGl.  This  constituent  may  serve  as  the  direct  object 
of  an  add  action  (step  708  in  ACTl).  For  the  parse  to 
succeed  with  only  one  deviation,  however,  ACTl’s  other 
required  step  (704)  must  be  satisfied  with  a  known 
member  of  the  class  addword. 

Since  no  known  addword  occurs  in  the  input,  a 
complete  parse  tree  cannot  be  constructed  and  CHAMP 
must  continue  at  Level  2.  Since  two  deviations  are  now 
permitted,  the  system  may  combine  the  indefinitely 
marked  group-gathering  with  the  unknown  lexeme 
“arrange”  to  form  a  complete  add  action.  A  substitution 
annotation  records  the  recovery. 

The  meaning  structure  produced  by  the  parser  is  called 
an  annotated  parse  tree,  or  APT.  As  seen  in  Figure  2,  an 
APT  contains  each  of  the  following  types  of  information 
(the  adaptation  context):  the  particular  grammatical  forms 


HRO  isa  m-hourform 
111  hourmarker 
112hourform 
required:  111  112 
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Figure  2:  The  annotated  parse  tree  (APT)  constructed  for  “Arrange  a  meeting  5  pm." 


used  to  recognize  the  constituents,  the  order  in  which  the 
constituents  t^peared  in  the  sentence,  and  the  recovery 
actions  used  to  modify  one  or  more  of  the  forms.  Every 
annotated  parse  tree  must  meet  the  pragmatic  constraints 
represented  by  the  contents  of  the  calendar  and  airline 
databases  (in  our  example,  the  five  p.m.  slot  for  the 
current  day  must  be  empty).  In  addition,  an  APT  for  a 
deviant  utterance  must  have  its  projected  effect  on  the 
calendar  confirmed  by  the  user  (in  our  example,  the  user 
would  be  shown  the  new  calendar  entry  and  asked  if  it 
should  be  added  to  the  schedule).  Once  the  meaning  has 
been  established,  deviation  annotations  must  be  converted 
into  new  grammatical  constituents  through  adaptation.* 

4.  Adaptation  and  Generalization  in  CHAMP 
The  purpose  of  adaptation  is  to  bring  a  deviant  form 
into  the  grammar  by  deriving  new  grammatical 
components  from  the  adaptation  context  that  will  parse  the 
deviation  directly  in  future  utterances.  The  particular  set  of 
new  components  that  are  added  to  the  grammar  depends 
upon  which  deviations  are  present  in  the  utterance.  As 
shown  in  Figures  3  and  4,  CHAMP  learns  new  lexical 
definitions  in  response  to  substitution  and  insertion 
deviations,  and  new  syntactic  forms  in  cases  of  deletion 
and  transposition. 


*CHAMP  produces  *11  panes  at  a  devialion-level.  If  different  APIs 
correspond  to  different  effects,  then  identifying  the  intended  effect  may 
also  disambiguate  the  pane.  If  a  number  of  APTs  all  correspond  to  the 
correct  effect  (because  of  true  stnictural  ambiguity  in  the  grammar),  then 
all  are  passed  onto  adapution  where  new  grammatical  constituents  are 
created  at  competitors.  A  competition  is  resolved  by  a  future  uttenmee 
that  requires  tome  constituents  but  not  othen. 


The  figures  reveal  that  CHAMP  learns  discriminations 
based  on  the  presence,  absence,  or  position  of  categories 
in  its  kernel  grammar.  Since  none  of  the  adaptations 
introduces  new  constituents  into  a  derived  form,  however, 
some  portions  of  the  user’s  natural  language  may  be  out  of 
reach.  Consider  the  following  sentences  taken  from  one 
user’s  first  session  with  CHAMP: 

s20:  change  June  1 1  ny  to  pgh  from  flight  82  to  flight  265 
s21:  change  from  flight  82  to  flight  265  on  June  1 1  ny  to  pgh 
s22:  change  flight  82  to  flight  265  on  June  1 1 

This  user’s  first  two  attempts  to  perfoim  the  task  require 
learning  that  is  beyond  the  scope  of  CHAMP’s  adaptation 
mechanism.  The  problematic  segment  is  “from  flight  82" 
which  the  parser  tries  to  explain  as  a  source  in  the  change 
action’s  source-target  pair  because  of  the  marker  “from.” 
No  kernel  form  permits  a  flight  object  in  the  source  (only 
flight  attributes  such  as  location),  and  the  system  provides 
no  way  to  introduce  the  possibility  into  the  language. 
Thus,  both  s20  and  s21  are  rejected.  S22,  which  does  not 
contain  the  misleading  marker,  is  parsed  by  a  kernel  form. 
Note  that  the  limitation  here  is  in  CHAMP — the  model 
places  no  restrictions  on  inserted  or  substiUited 
constituents. 

In  adapting  to  an  identified  deviation,  we  want  to 
construct  a  set  of  new  grarrirnalica!  ccriiponents  with  two 
properties.  First,  they  must  be  accessible  during  future 
parses  to  understand  this  sentence  and  others  like  it 
directly.  Second,  the  new  components  must  not  add 
unduly  to  the  cost  of  understanding  sentences  in  which 
they  ultimately  play  no  role.  In  short,  the  main  issue  in 
adaptive  parsing,  as  in  most  kinds  of  learning,  is  the  issue 
of  generalization:  capturing  the  correct  conditions  on 
usage  of  the  deviation. 
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Substitution 

•  Adaptation:  a  new  lexical  definition  is  added  to  the  grammar  for  the  unknown  word  or  phrase  as  a  member 
of  the  word  class  that  required  the  substiuition.  Example:  “Arrange  a  meeting  at  5  pm”  where  “arrange’’is 
unknown  results  in  the  addition  of  “arrange”  to  the  lexical  class  addword. 

•  Generalization:  occurs  through  class  references  in  the  grammar.  Everywhere  the  word  class  was  previously 
sought,  the  new  lexeme  is  recognized. 

•  Learning:  permits  substitutions  of  words  and  phrases  only.  Extends  the  set  of  indices  for  discriminations 
already  present  in  the  grammar  through  the  word  class.  Loses  potential  discriminaiions  based  upon  the 
tokens  themselves,  or  the  co-occurrence  of  the  tokens  or  class  with  other  classes  or  deviations  present  in  the 
adaptation  context. 

•  Search:  reduces  the  deviation-level  required  to  understand  future  occunences  of  the  lexeme.  If  the  lexeme 
was  already  defined  in  the  lexicon,  the  new  aeflnition  increases  the  amount  of  lexical  ambiguity  in  the 
system.  This  may  create  additional  search  paths  for  utterances  that  contain  the  lexeme,  but  has  no  t^fect  on 
the  search  space  for  utterances  that  do  not  contain  it. 

Insertion 

•  Adaptation:  a  new  lexical  defmition  is  added  to  the  grammar  for  the  unknown  word  or  phrase  as  a  member 
of  the  special  class  insertword.  Example:  “Please  cancel  flight  #451”  adds  “please”  to  insertword. 

•  Generalization:  occurs  throughout  the  grammar.  The  new  lexeme  is  allowed  to  occur  without  deviation 
anywhere  in  subsequent  utterances. 

•  Learning:  assumes  the  lexeme  carries  no  meaning.  Permits  insertion  of  words  and  phrases  only.  Loses 
potential  discriminations  based  upon  the  co-occurrence  of  the  tokens  with  other  classes  or  deviations  present 
in  the  adaptation  context 

•  Search:  same  as  for  substitution. 

Figure  3:  Summary  of  adaptation  to  substitutions  and  insertions. 


With  respect  to  accessibility.  Figures  3  and  4  show  that 
a  derived  component  that  recognizes  a  new  way  of 
referring  to  a  class  of  constituents  is  “inherited  upward” 
through  the  grammar.  In  other  words,  the  new  componrat 
can  be  used  by  all  forms  that  call  for  constituents  of  that 
class.  The  main  advantage  to  using  class  inheritance  for 
generalization  is  that  it  provides  a  simple,  uniform 
mechanism.  The  disadvantage  is  undergeneralization; 
learning  across  established  boundaries  in  the  grammar 
requires  discrete  episodes.  Consider,  as  an  example,  that 
the  kernel  distinguishes  group-gatherings  marked  by 
indefinite  articles  from  those  marked  by  definite  articles 
by  assigning  the  former  to  the  class  m-i-ggforms  (which 
can  be  part  of  an  add  action)  and  the  latter  to  m-d- 
ggforms  (which  cannot).  The  first  time  a  user  drops  an 
article  from  her  utterance,  she  does  so  in  one  context  or 
the  other,  CHAMP  team.s  a  new  in.stance  of  one  of  the 
classes,  but  not  of  both. 

With  respect  to  cost,  two  factors  are  relevant: 
overgeneralizalion  and  ambiguity.  Although  inheritance 
upward  can  overgeneralize  the  correct  conditions  on 
usage,  bottom-up  parsing  algorithms  tend  to  compensate 
for  this  fairly  well.  Ambiguity  is  more  problematic.  As 
Figure  3  reveals,  substitution  and  insertion  adaptations 
may  introduce  lexical  ambiguity  if  the  new  definitions  act 


as  alternatives  to  entries  already  in  the  lexicon.^  The  cost 
of  the  ambiguity  during  search  depends  initially  upon  the 
number  of  additional  forms  indexed  by  the  new  defmition. 
The  degree  to  which  the  ambiguity  propagates  through  the 
search  space  depends  upon  what  other  constraining 
information  is  provided  by  the  utterance. 

Deletion  adaptations  may  introduce  structural 
ambiguity  into  the  grammar.  Let  us  reconsider  the  case  of 
m-i-ggforms  and  m-d-ggforms.  In  the  kernel  there  is  only 
one  member  of  each  class,  DGGl  and  IGGl.  Both  foms 
require  a  marker  and  a  subconstituent  from  the  class 
ggforms.  The  critical  difference  between  the  forms  is  the 
marker  class;  as  long  as  each  form  requires  a  different 
marker,  both  forms  cannot  succeed  at  the  same  deviation- 
level.  If  the  user  drops  both  defmite  and  indefinite  articles, 
however,  the  system  will  derive  forms  omitting  each 
marker;  both  DGGl’  and  IGGl’  will  require  only  one 
subconstituent,  an  unmarked  group-gathering.  Viewed 
differently,  any  unmarked  group-gathering  object  will 
succeed  at  Level  0  in  the  adapted  grammar  as  a  member  of 
both  classes.  Again,  the  cost  of  the  ambiguity  during 
search  depends  initially  upon  the  number  of  additional 
forms  indexed  by  the  incorrect  class  assignment.  The 


CHAMP,  it  is  impossible  to  introduce  «  new  leucaJ  definition  for  t 
word  or  phrase  that  is  already  deflned  without  employing  entra-Iinguistic 
conventions.  The  lexical  extension  problem  is  discussed  in  (6). 
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degree  to  which  the  cost  propagates  depends  upon  the 
other  constituents  in  the  sentence. 

Transpositions  may  introduce  ambiguity  if  the  critical 
difference  between  the  parent  and  the  derived  form  is  in 
an  ordering  relation  on  constituents  not  present  in  the 
utterance.  Consider  the  example  in  Figure  4.  ACTl  and 
ACT!’  differ  by  the  ordering  relation  imposed  on 
addword  and  m-dateform.  Thus,  if  no  date  appears  in  an 
utterance  recognizable  by  ACTl,  the  utterance  will  also  be 
recognizable  by  ACTl’.  Because  of  an  optimization  in  the 
implementation,  this  sort  of  adaptation  only  introduces 
ambiguity  at  non-zero  deviation-levels,  although  the 
ambiguity  always  propagates  through  the  search  space. 

It  is  important  to  note  that  even  though  each  type  of 
adaptation  may  increase  the  amount  of  ambiguity  in  the 
current  grammar,  the  future  search  space  for  utterances 
containing  the  deviation  is  always  signiTicantly  smaller 
than  it  would  have  been  had  we  not  adapted.  The  reason  is 
simple:  any  ambiguous  paths  at  Level  0  were  also  part  of 
the  much  larger  search  space  at  the  higher  deviation-level. 


Of  course,  the  cost  of  parsing  some  utterances  that  were  in 
the  grammar  prior  to  the  adaptation  will  have  increased, 
but  three  factors  make  the  trade-off  worthwhile.  First, 
since  adaptations  reflect  preferred  forms  of  expression, 
utterances  relying  on  the  adaptations  are  likely  to 
ret^rpear;  the  more  often  an  ack^tation  is  reused  the  more 
favorable  the  trade-off  becomes.  Second,  adaptation 
brings  more  of  the  user’s  language  into  the  grammar, 
resulting  in  fewer  rejected  parses  over  time.  Since 
rejecting  a  sentence  requires  a  search  through  Level  2  plus 
the  search  associated  with  understanding  any  subsequent 
rephrasings,  any  action  that  prevents  rejections  must 
reduce  search  overall.  Finally,  since  the  frequent  user’s 
language  is  self-bounded,  whatever  increase  in  ambiguity 
results  from  adaptation  must  be  bounded  as  well. 

5.  Analysis  of  the  Utility  of  Adaptation 

The  purpose  of  this  section  is  to  provide  a  sense  of  the 
overall  utility  of  adaptation  and  generalization  by 
answering  two  questions  for  CHAMP’S  performance  on 
users’s  spontaneously  generated  input  First,  what  is  the 


Deletion 


•  Adaptation:  a  new  form  is  added  to  the  grammar.  It  is  derived  from  the  form  in  which  the  deviation 
occurred  and  inherits  all  the  information  present  in  its  parent  that  does  not  relate  to  the  deleted  step. 
Example:  “Arrange  a  meeting  5  pm’’  results  in  the  creation  of  HRO’  derived  from  HRO.  HRO’  is  also  a 
member  of  the  class  m-hourforms  but  its  strategy  and  required  lists  do  not  contain  hourmarker. 

•  Generalization;  occurs  through  class  references  in  the  grammar.  The  derived  form  may  be  used  anywhere 
the  parent  form  is  used  to  recognize  a  constituent.  Deletions  and  banspositions  within  a  single  constituent 
are  preserved  as  co-occuring. 

•  Learning;  permits  future  discriminations  based  on  the  presence  or  absence  of  a  class.  Permits 
discriminations  based  upon  some  co-occurrences  of  deviations. 

•  Search:  reduces  the  deviation-level  required  to  understand  future  utterances  containing  the  deletion.  May 
introduce  structural  ambiguity  into  the  grammar  if  the  deleted  step  represents  the  sole  difference  between 
two  grammatical  categories. 


Transposition 


•  Adaptation:  a  new  form  is  added  to  the  grammar  that  explicitly  captures  the  new  ordering  of  steps.  The  new 
form  inherits  all  the  information  present  in  the  parent  that  does  not  relate  to  the  ordering  of  the  transposed 
step.  The  ordering  relations  on  the  transposed  step  depend  on  the  steps  sunounding  it  after  transposition. 
Example:  “Schedule  on  June  4  a  meeting  with  Alice’’  results  in  the  creation  of  ACTl’  from  ACTl.  ACTl’ 
has  the  strategy  (m-hourform  m-intervalform  addword  m-dateform  m-i-ggform)  where  only  the  hour  and 
time  interval  are  unordered. 


•  Generalization:  occurs  through  class  references  in  the  grammar.  The  derived  form  may  be  used  anywhere 
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are  preserved  as  co-occuring. 


•  Learning:  permits  discriminations  based  on  ordering.  Permits  discriminations  based  upon  some  co- 
occunences  of  deviations. 


•  Search;  reduces  the  deviation-level  required  to  understand  future  utterances  that  reflect  the  new  ordering.  If 
the  pansposed  step  is  not  required,  the  adaptation  may  introduce  ambiguity  into  the  search  space  at  non-zero 
deviation-levels  whenever  the  transposed  step  is  not  needed  to  understand  the  utterance. 


Figure  4:  Summary  of  adaptation  to  deletions  and  transpositions. 
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Userl 

User  2 

User  3 

User  4 

User  9 

User  10 

18/127 

34/144 

9/138 

25/130 

20/212 

19/173 

(14%) 

(24%) 

(7%) 

(19%) 

(9%) 

(11%) 

115/127 

127/144 

112/138 

118/130 

192/212 

148/173 

(91%) 

(88%) 

(81%) 

(91%) 

(91%) 

(86%) 

Figure  5:  The  effectiveness  of  learning  as  measured  by  the  differences  in  the  number  of 
utterances  understood  by  the  kernel  and  by  the  user’s  final  grammar. 


effectiveness  of  learning?  That  is,  how  much  actual 
improvement  in  understanding  the  user's  language  comes 
from  adaptation?  Second,  what  is  the  cost  of  learning? 
That  is,  what  effect  do  adaptation  and  generalization  have 
on  the  level  of  ambiguity  in  the  grammar? 

The  data  used  during  evaluation  came  from  two 
sources.  In  early  experiments.  Users  1,  2,  3,  4,  5,  and  7 
interacted  with  a  simulated  adaptive  interface  based  on  the 
model  described  in  Section  2  (Users  6  and  8  were  in  a 
non-adaptive  control  condition).  Of  the  6S7  utterances 
collected  from  these  users,  8S  were  chosen  from  Users  1 
and  2  to  guide  in  the  design  of  CHAMP.  Users  9  and  10 
participated  in  subsequent,  on-line  experiments  with 
CHAMP,  producing  an  additional  385  test  utterances.  In 
both  experiments  the  user’s  task  was  to  look  at  a  pictorial 
representation  of  a  change  to  the  calendar  and  use  the 
system  to  effect  that  change  in  the  on-line  schedule.  Since 
Users  5  and  7  did  not  complete  all  nine  experimental 
sessions,  we  consider  here  only  results  for  the  six  users 
who  did.^ 

What  is  the  effectiveness  of  learning  in  CHAMP?  To 
answer  this  question  we  contrast  for  each  user  the  number 
of  her  sentences  accepted  by  the  original  kernel  and  the 
number  xcepted  by  her  final  grammar— any  increase  is 
due  to  learning.  Figure  S  shows  that  the  increase  is 
significant  for  each  user  and  that  CHAMP’s  performance 
on  data  collected  during  the  simulation  experiments  did 
not  differ  significantly  from  the  system’s  performance  on 
data  from  the  on-line  experiments.  On  the  average,  the 
kernel  accepts  only  16%  of  a  user’s  utterances  while  her 
own  derived  grammar  accepts  88%. 


^Because  of  differences  between  the  model  and  the  implemenution 
tome  relatively  minor  preprocessing  of  utterances  from  the  simulation 
experiment  was  done  prior  to  evaluation.  In  addition,  extra-grammatical 
markers  were  introduced  to  compensate  for  the  problem  of  lexical 
extension  in  both  groups  (see  previous  footnote).  For  a  full  description  of 
the  experiments,  a  list  of  the  test  utterances  (with  modifications 
indicat^),  and  a  number  of  other  evaluation  results,  tee  [6]. 


To  carry  the  analysis  further,  we  measure  the  utility  of 
each  learning  episode  by  computing  the  average  number 
of  sentences  brought  into  the  grammar  each  time  a  deviant 
utterance  is  interpreted.  These  values  (3.3,  2.7,  2.3,  3.7, 
5.7,  and  5.6)  represent  a  kind  of  “bootstr^ping  constant” 
that  reflects  the  way  in  which  CHAMP’s  particular 
implementation  of  adaptation  captures  within-user 
consistency.  An  alternative  implementation  would 
probably  produce  very  different  values.  Consider,  for 
example,  an  interface  using  a  more  conservative  approach 
to  substitutions  by  creating  a  new  class  for  the  substituted 
word  and  a  new  form  requiring  that  class  (unlike  CHAMP, 
such  a  system  permits  discriminations  based  on  the  tokens 
themselves).  The  tendency  of  this  approach  to 
undetgeneralize  is  likely  to  appear  as  a  decrease  in  the 
utility  of  each  learning  episode;  the  user’s  final  grammar 
might  result  in  fewer  acceptances,  or  in  the  same  number 
of  acceptances  but  at  the  cost  of  requiring  more  instances 
of  adaptation.  Thus,  the  values  themselves  are  not  as 
important  as  the  fact  that  our  ability  to  compute  them 
provides  a  metric  for  comparing  design  choices. 

The  second  question  we  posed  was;  what  is  the  cost  of 
ad^tation?  As  CHAMP  brings  more  of  the  user’s 
language  into  the  grammar  it  increases  the  likelihood  that 
it  will  understand  her  future  utterances.  But  is  the  increase 
in  understanding,  as  measured  by  acceptances,  negated  by 
a  larger  inCTease  in  the  cost  of  understanding,  as  measured 
by  search?  We  know  that  the  user’s  language  is  self- 
bounded,  but  it  may  still  be  quite  ambiguous.  The 
question,  then,  is  not  whether  we  can  prevent  the  rise  in 
search  stemming  from  inherent  ambiguity  in  the  user’s 
idiolect,  but  whether  the  system  as  a  whole  suffers 
disproportionately  as  ambiguity  increases. 

We  measure  the  rise  in  ambiguity  in  an  adapted 
grammar  in  two  ways.  First,  holding  the  test  sentences 
constant,  we  compare  the  average  number  of  parse  states 
considered  by  successive  grammars  during  search  at  Level 
0.  As  the  amount  of  ambiguity  in  the  grammar  increases 
through  the  adaptations  of  each  session,  so  will  the 
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Number  of  utterances 

Average  number  of 

Average  number  of 
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states  to  accept 

parse  trees/utterance 

m 

G(0) 
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G(0) 

G(9) 

U1 

18 
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19.3 

37.6 

1.1 

1.5 

U2 

34 

123 

21.9 

59.8 

1.1 

1.6 

U3 

9 

112 

18.0 

43.3 

1.0 

2.0 

U4 

25 

118 

21.0 

40.7 

1.1 

1.5 

U9 

20 

192 

13.0 

80.9 

1.1 

2.6 

UlO 

19 

148 

40.4 

45.8 

1.1 

1.7 

Figure  6:  Change  in  the  cost  of  parsing  non-deviant  utterances  as  a  function  of  grammar  growth. 


average  amount  of  search  required  to  accept  an  utterance 
at  Level  0.  Second,  we  examine  the  average  number  of 
APTs  produced  for  each  sentence  accepted  at  Level  0. 
This  value  reflects  a  rise  in  ambiguity  that  is  partly 
independent  of  the  increase  in  search  because  of  the  way 
APTs  share  substructure.  A  rise  in  search  need  not  give 
rise  to  additional  parse  trees.  Similarly,  additional  APTs 
may  indicate  only  a  modest  increase  in  search.  From  the 
user’s  point  of  view  increased  search  corresponds  to 
decreased  response  time  while  a  rise  in  the  number  of 
parses  corresponds  to  a  rise  in  the  number  of  interactions 
required  for  resolution  of  her  intended  meaning. 

Figure  6  summarizes  the  relevant  measures  for  each 
user  by  showing  the  values  for  the  kernel  grammar  (G(0)) 
and  for  her  adapted  grammar  at  the  end  of  session  nine 
(G(9)).  For  four  of  the  six  users,  the  increase  in  the 
number  of  parse  states  examined  is  proportionally  far  less 
than  the  increase  in  the  number  of  sentences  understood. 
By  the  end  of  the  ninth  session,  the  search  for  User  1  has 
expanded  by  a  factor  of  two  but  the  number  of  her 
sentences  that  are  now  non-deviant  has  expanded  by  a 
factor  of  six.  User  3’s  trade-off  is  even  more  favorable: 
twelve  times  as  many  sentences  are  accepted  by  the  final 
grammar  as  by  the  kernel,  at  a  cost  of  only  two  and  a  half 
times  the  search.  User  4  gains  almost  five  times  as  many 
sentences  at  slighdy  less  than  twice  the  search.  The  trade¬ 
off  is  most  favorable  for  User  10  who  gains  almost  eight 
times  as  many  accepted  sentences  with  virtually  no 
increase  in  search.  User  2  has  the  most  balanced  case:  a 
factor  of  3.5  increase  in  accepted  utterances  and  a  facter  of 
three  increase  in  search.  The  largest  increase  in  search 
occurs  for  User  9  (about  a  factor  of  six)  but  the  ten-fold 
increase  in  the  number  of  her  sentences  accepted  still 
places  her  within  the  general  trade-off  ratios  seen  among 
the  others. 

If  increase  in  search  corresponds  to  increase  in  response 
time,  what  do  these  values  tell  us?  In  short,  response  times 
will  get  slower  over  all  but  will  not  grow  exponentially  as 


a  function  of  the  increase  in  the  language  accepted.  Since 
the  grammar  itself  is  bounded  by  the  user’s  natural 
behavior,  the  increase  in  response  time  is  bounded  as  well. 
Evra  with  the  kinds  of  increases  in  search  seen  for  these 
users,  response  times  were  usually  under  ten  seconds. 

The  near-monotonic  increase  in  the  size  of  the  search 
goes  hand-in-hand  with  a  near-monotonic  increase  in  the 
average  number  of  APTs  produced  for  each  accepted 
utterance.  Where  is  the  ambiguity  coming  from?  The  users 
do  introduce  some  lexical  ambiguity  into  their 
idiosyncratic  grammars,  but  in  CHAMP’s  parser  lexical 
ambiguity  contributes  primarily  to  small,  local  inaeases  in 
the  size  of  the  search  space.  Examination  of  the  parse  trees 
produced  with  successive  grammars  for  the  same  sentence 
showed  that  most  of  the  increase  in  ambiguity  comes  from 
adaptation  to  deletion  deviations.  As  the  user’s  language 
becomes  increasingly  terse,  the  system  builds  forms  in 
which  the  content  words  that  correspond  to  critical 
differences  between  strategies  are  deleted.  As  a  result, 
there  is  an  increase  in  the  number  of  constituents  that  are 
satisfied  by  each  segment;  often  the  increase  propagates  to 
the  root  nodes  themselves.  User  9  is  a  case  in  point:  her 
grammar  included  derived  forms  omitting  almost  every 
content  word  in  the  kernel.  Specifically,  by  the  end  of 
session  three  she  had  dropped  most  markers,  three  of  the 
four  verbs,  and  two  of  the  four  group-gathering  head 
nouns.  As  a  result,  almost  every  sentence  she  typed  in  the 
last  seven  sessions  created  at  least  two  APTs.  On  the  last 
day  the  simple  sentence,  “Dinner  June  24  with  Allen,” 
created  twelve  parse  trees  at  Level  0. 

6.  Summary  and  Future  Work 

In  total,  CHAMP  has  been  tested  on  1042  utterances 
most  of  which  represent  unmodified  spontaneous  input  by 
frequent  users  whose  job  includes  calendar  scheduling  as 
part  of  its  duties.  We  found  no  qualitative  differences 
between  CHAMP’S  performance  for  utterances  gathered 
by  simulation  of  the  model  and  the  system’s  performance 
for  utterances  produced  during  on-line  interactions.  We 
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did  find  that  acceptance  of  user  utterances  relied 
significantly  on  adaptation.  On  average,  the  kernel 
grammar  accepted  7%  to  24%  of  each  users  utterances, 
while  her  final  grammar  accepted  81%  to  91%.  We  also 
found  two  areas  of  system  performance  that  could  be 
improved. 

First,  although  CHAMP  was  able  to  understand  about 
84%  of  each  user’s  utterances,  this  rate  is  about  8%  lower 
than  that  predicted  by  the  model  for  the  test  sets.  The 
difference  is  caused  primarily  by  deviations  that  require 
the  system  to  introduce  a  new  type  of  meaningful 
constituent  into  a  context  that  does  not  expect  it  An 
unrestricted  implementation  of  insertion  and  substitudon 
recoveries  would  solve  the  problem,  although  such  an 
extension  would  have  non-trivial  implicadons  for 
learning.  A  single  inserdon,  for  example,  can  usually  be 
attached  at  many  locadons  in  the  parse  tree,  any  of  which 
may  represent  the  correct  or  most  useful  generalizadon  of 
the  utterance’s  structure.  It  is  unclear  whether  the  exisdng 
compeddon  mechanism  is  adequate  to  eventually 
determine  the  most  useful  form.  If  not,  what  heurisdcs 
might  we  use  to  decide? 

Second,  although  CHAMP’S  adaptadons  resulted  in 
only  a  modest  rise  in  ambiguity  in  each  adapted  grammar, 
the  increase  is  only  partly  explained  by  inherent  ambiguity 
in  the  user’s  idiolect.  A  significant  portion  of  the  added 
ambiguity  comes  from  the  interaction  between  critical 
differences  and  adaptation  to  deletion  deviations.  Kernel 
forms  were  designed  to  contain  critical  differences  with 
respect  to  each  other  because  such  differences  help  to 
constrain  search.  Yet,  whenever  adaptation  eliminates  a 
difference,  ambiguity  may  result  Thus,  a  possible  method 
for  overcoming  this  source  of  performance  degradation  is 
to  track  critical  differences  explicitly  and  create  a  new 
grammatical  categexy  whenever  a  discrimination  that 
separates  two  existing  categories  disappears.  Such  an 
extension  raises  interesting  questions:  when  do  you 
replace  existing  categories,  and  when  do  you  simply 
augment  the  grammar  with  a  new  category?  In  the  case  of 
augmentation,  which  previously  existing  forms  should 
refer  to  the  new  category  and  which  to  the  old? 

Regardless  of  the  improvements  offered  above, 
evaluation  of  the  system  clearly  shows  that  adaptation  is  a 
robust  learning  method.  CHAMP  was  able  to  learn  eight 
very  different  grammars,  corresponding  to  very  different 
linguistic  styles,  with  a  single  general  mechanism.  We 
believe  this  represents  a  significant  step  forward  in  the 


effort  to  provide  users  with  full  freedom  of  expression  in 
computer  interactions. 
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Abstract 

We  present  a  machine  learning  program, 
called  ANT,  which  learns  the  grammar  of  a 
second  language.  Input  to  the  system  is  simi¬ 
lar  to  what  is  found  in  a  typical  introductory 
foreign  language  text;  that  is,  a  mixture  of  in¬ 
structions  describing  grammar  rules,  and  ex¬ 
amples  illustrating  these  rules.  We  compare 
ANT’s  learning  to  two  alternatives:  learn¬ 
ing  from  only  instructions,  and  learning  from 
only  examples.  We  discuss  why,  from  a  func¬ 
tional  or  processing  standpoint,  learning  from 
a  mixed  input  is  more  effective  than  either 
of  the  alternatives.  We  also  present  an  em¬ 
pirical  comparison  of  our  algorithm’s  perfor¬ 
mance  on  input  containing  both  instructions 
and  examples  vs.  performance  of  the  sys¬ 
tem  when  given  instructions  only  or  examples 
only.  The  results  of  the  comparison  support 
our  hypotheses  as  to  the  utility  of  mixed  in¬ 
put. 

1  Introduction 

This  paper  describes  a  program  called  ANT  (Acqui¬ 
sition  using  Native-language  TVansfer),  which  learns 
the  grammar  of  a  second  language.  ANT  successfully 
learns  approximately  85%  of  the  grammar  rules  pre¬ 
sented  in  a  typical  first-year  German  textbook.  Input 
to  the  system  is  similar  to  what  is  found  in  a  typi 
cal  introductory  text.  The  system  modifies  its  English 
grammar  rules  accordingly,  so  that  they  correspond  to 
the  grammar  of  German.  ANT  can  then  “understand” 
German  sentences. 

Most  all  foreign  language  texts  follow  the  same  for¬ 
mat  when  presenting  a  new  grammatical  construction. 
When  one  looks  into  a  typical  text,  one  does  not  find 
tx  list  of  insirtsetiens  which  explain  tlic  gramniatic'^! 
constructions  of  the  second  language.  Nor  does  one 
find  simply  a  list  of  example  sentences  wliich  illus¬ 
trate  the  foreign  language’s  grammar.  I'exts  almost 
always  contain  in.ctructions  and  examples  integratid 
together.  The  general  format  is  the  introduction  of  a 


new  rule  with  an  instruction,  followed  by  a  set  of  exam¬ 
ples  which  illustrate  the  rule  taught  in  the  instruction. 
Why  does  this  seem  to  be  the  optimal  format  for  such 
a  text,  or  at  least  the  most  common  format? 

In  this  paper,  we  will  address  this  question  from  the 
standpoint  of  a  machine  learning  theory.  We  present 
our  theory  of  learning  from  instructions  (which  in¬ 
clude  examples),  which  has  been  implemented  in  ANT. 
We  then  discuss  why,  from  a  functional  or  processing 
stjuidpoint,  learning  from  this  sort  of  input  is  more  ef¬ 
fective  than  either  obvious  alternative,  learning  from  a 
set  of  examples,  or  learning  from  instructions  without 
any  examples.  In  building  ANT,  we  have  discovered 
several  reasons  why,  computationally,  it  is  helpful  for 
our  program  to  be  given  instructions  along  with  ex¬ 
amples  in  order  to  learn  new  grammar  rules. 

To  provide  further  evidence  in  support  of  our  the¬ 
ory,  we  present  an  empirical  comparison  of  three  types 
of  learning:  ANT’s  approach,  which  uses  both  exam¬ 
ples  and  instructions;  a  version  of  ANT  which  learns 
from  instructions  only;  and  a  version  which  learns  from 
examples  only.  In  terms  of  both  efficiency  and  correct¬ 
ness,  the  original  version’s  performance  is  superior  to 
the  performance  of  the  two  alternatives. 

2  An  example  of  ANT’s  performance 

To  explain  how  ANT  Iccirns,  we  begin  by  presenting  an 
example  lesboii.  While  there  are  differences  in  ANT’s 
processing  depending  on  what  type  of  rule  it  is  learn¬ 
ing,  this  example  illustrates  the  main  points  of  how 
ANT  works.  The  example  lesson  is  the  following. 

In  German,  verbs  come  at  the  end  of  relative 
clauses. 

Examples: 

Der  Erdfcrkel,  der  Ameisen  oft  friCt, 
lauft  langsam 

(the  (uuuvaik  Wiko  ants  often  eats 
runs  slowly)* 

'ant  does  not  receive  an  English  translation  as  part  of 
its  input.  The  literal  English  translation  is  provided  here 
for  the  benefit  of  the  reader. 
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Der  Mann,  der  mir  Biicher  gibt, 

wolint  in  Paris. 

(the  man  wlio  me  books  gives 

lives  in  Paris) 

ANT’s  grammar  rules  are  written  in  a  unification- 
style  format  (Shieber,  1986),  as  discussed  in  (Lytinen 
and  Moon,  1988;  Moon  and  Lytinen,  1989).  However, 
for  the  purposes  of  this  paper,  we  can  assume  that 
they  are  context-free  rules. 

Before  receiving  this  input,  ANT  assumes  that  its 
English  relative  clause  (RC)  rules  ajiply  to  German.* 
Two  of  these  rules  are  the  following: 

(1)  RC  -  RP  VG  NP 

(2)  RC-RPVGNPNP 

Some  of  ANT’s  rules  for  VG’s  (verb  group)  and 
CONJ-V  (conjugated  verb)  are  given  below: 

(3)  YG  -  CONJ-V  ADV 

(4)  VG  -  CONJ-V 

(5)  VG- CONJ-V  INF 


(6)  CONJ-V -AUX 

(7)  CONJ-V  -  MODAL-AUX 

(8)  CONJ-V -V 

ANT  produces  the  following  representation  when  it 
reads  the  instruction: 

ORDER 

CONSTITUENT  ;  CONJ-V 
OCCURS-IN  :  RC 
POSITION  ;  LAST 

This  means  for  ANT  that  the  rule  to  be  learned  has 
to  do  with  ordering,  specifically  with  the  position  of 
the  conjugated  verb  in  the  relative  clause. 

Modifying  ANT’s  English  relative  clause  rules  to 
conform  to  German  is  not  a  straightforward  mat¬ 
ter.  This  is  because  the  context-free  rules  for  rela¬ 
tive  clauses  (RC)  do  not  even  mention  verbs.  The 
constituent  VG  appears  in  them,  but  simply  moving 
the  VG  to  the  end  of  each  RC  rule  would  not  be  the 
correct  modification,  since  the  verb  is  not  always  the 
last  constituent  in  a  VG.  Thus,  ANT  cannot  simply 
cliiingc  its  HC  rules.  Other  relevant  rules,  such  as  the 
rules  for  what  makes  up  a  VG,  potentially  must  also 
be  changed. 

This  is  where  examples  come  into  play  in  the  learn¬ 
ing  process.  Without  examples,  ANT  would  have  to 
search  through  its  grammar  for  possible  appearances 
of  verbs  within  relative  clauses.  In  the  worst  case,  this 
could  mean  searching  the  entire  grammar,  since  a  verb 
could  in  theory  appear  inside  of  any  constituent  of  a 
RC,  and  RC’s  could  possibly  contain  every  other  kiml 

*This  is  analogous  to  a  well-known  plicnonicnon  in  hu¬ 
man  second  language  learning,  known  as  native  language 
transfer,  in  which  learners  typically  selectively  transfer  pat¬ 
terns  from  the  native  language  to  the  second  language 
(Sejinker,  1969). 


of  constituent.  In  addition,  we  could  not  in  general 
guarantee  that  the  search  would  find  the  correct  verb, 
as  there  could  be  (and  in  fact  there  are)  several  occur¬ 
rences  of  VG  and  CONJ-V  within  relative  clauses.^ 
However,  because  ANT  is  provided  with  examples  in 
addition  to  instructions,  ANT  lets  the  examples  guide 
it  to  the  rules  which  must  be  changed.  ANT  does 
this  by  parsing  the  examples.  During  the  parse,  ANT 
is  forced  to  use  the  rules  which  must  be  modified  for 
German.  Thus,  the  potentially  large  search  through 
the  grammar  is  avoided. 

Getting  back  to  our  example,  because  the  rule  ANT 
is  learning  is  an  ordering  rule,  ANT  relaxes  the  order¬ 
ing  constraints  in  its  relative  clause  rules  when  parsing 
the  examples,  meaning  that  it  will  be  able  to  parse  a 
sentence  whose  relative  clause  word  ordering  does  not 
conform  to  English  grammar.'^  Let  us  consider  the  first 
example: 

Der  Erdferkel,  der  Ameisen  oft  friBt,  lauft 
langsam 

(the  iiardvark  who  ants  often  eats  runs 
slowly) 

ANT  parses  this  example,  arriving  at  the  pauo  tree 
in  Figure  1.  Once  the  example  is  parsed,  it  is  clear 
that  the  verb  “friBt”  (eats)  is  the  verb  in  the  rela¬ 
tive  clause.  Now  ANT  has  identified  the  constituent 
that  must  move:  “friBt”  is  the  CONJ-V  within  the  VG 
within  the  RC.  This  information  tells  ANT  that  the 
category  VG  must  be  modified,  at  least  when  VG’s 
appear  in  relative  clauses.  This  leads  ANT  to  mod¬ 
ify  rule  1  from  before  to  construct  the  following  new 
grammar  rules: 

(9)  RC  -  RP  NP  VCOMP  CONJ-V 

(10)  RC  -  RP  NP  CONJ-V 

Notice  that  the  category  VG  has  been  eliminated 
from  these  rules.  This  is  because  the  rules  for  VG  must 
stay  the  same,  since  verb  groups  occur  in  other  rules 
besides  RC  rules.  In  RC  rules,  VG  is  replaced  by  the 
conjucated  verb  (CONJ-V)  along  with  a  new  category, 
called  VCOMP,  the  vestiges  of  the  VG  category.  The 
rules  for  VCOMP  are  as  follows: 

(11)  VCOMP  —  ADV 

(12)  VCOMP  -  INF 

Two  RC  rules  (9  and  10)  are  generated  because 
VCOMP  is  optional. 

Processing  of  the  second  example  helps  ANT  modify 
relative  clause  rule  2  in  a  similar  way,  creating  the 
following  final  set  of  rules. 

(13)  RC  -  RP  VCOMP  NP  CONJ-V 

(14)  RC  -  RP  NP  CONJ-V 

^For  example,  relative  clauses  contain  NP’s,  wliicli  can 
contain  other  relative  clauses,  which  contain  verbs. 

^The  way  in  which  ANT’s  unification  grammar  rules 
are  written  is  essential  to  ANT’s  ability  to  parse  sentences 
which  do  not  conform  to  its  grammar,  and  is  beyond  the 
scope  of  this  paper. 
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Figure  1:  Parse  of  the  aardvark  example 

(15)  RC  RP  VCOMP  NP  NP  CONJ-V 

(16)  RC  RP  NP  NP  CONJ-V 

3  Why  ANT  needs  instructions  and 
examples 

We  have  seen  an  example  of  how  ANT  processes  input 
to  learn  a  new  grammar  rule.  The  input  consists  of 
a  mixture :of  instructions  and  examples.  Now  let  us 
summarize  the  roles  that  the  two  parts  of  the  input 
play  in  the  learning  process. 

3.1  The  role  of  instructions 

When  ANT  learns  a  new  rule,  there  are  three  distinct 
roles  played  by  the  instruction  portion  of  the  input; 

•  alter  expectations  of  input 

•  focus  attention  on  part  of  the  example  that  illus¬ 
trates  the  new  information 

•  determine  how  general  the  new  rule  is 

The  actual  formation  of  new  rules  takes  place  dur¬ 
ing  the  processing  of  examples.  However,  the  way  in 
which  examples  are  processed  is  greatly  influenced  by 
the  instruction.  First,  ANT’s  ability  to  parse  German 
examples  is  facilitated  by  the  instruction,  because  it 
tells  the  system  something  about  the  differences  to 
be  expected  in  the  examples.  In  our  relative  clause 
rule,  the  instruction  told  the  byblcin  that  its  Motd  or¬ 
der  constrMnts  might  be  violated  within  example  rel¬ 
ative  clauses.  This  enabled  ANT  to  parse  these  exam¬ 
ples  even  though  they  did  not  conform  to  its  (English) 
grammar. 

The  instruction  also  improves  the  formation  of  new 
rules.  From  the  information  in  the  instruction,  the  sys¬ 
tem  can  focus  attention  on  the  important  part  of  the 
example,  arid  thus  can  more  quickly  find  the  change(s) 


to  be  made  to  the  original  rules.  Often  the  instruc¬ 
tion  clearly  indicates  how  general  the  new  rules  should 
be.  In  our  relative  clause  example,  the  instruction  told 
the  system  that  any  modifications  to  the  category  VG 
should  only  affect  relative  clauses.  Without  this  in¬ 
formation,  ANT  would  have  had  no  way  of  knowing 
whether  all  verb  groups  are  different  in  German  than 
they  are  in  English,  or  perhaps  some  other  subclass 
of  VG’s,  such  as  only  verb  groups  followed  by  a  single 
NP.  Many  such  hypotheses  would  be  consistent  with 
the  examples  that  ANT  received. 

3.2  The  role  of  examples 

While  instructions  are  critical  to  ANT’s  performance, 
the  examples  presented  along  with  the  instructions 
also  play  a  crucial  role  in  the  system’s  learning  pro- 
CC.SS.  In  particular,  examples  play  the  following  three 
roles: 

•  identify  relevant  previous  knowledge 

•  form  new  rules 

•  fill  in  details  not  mentioned  in  instructions 

The  first  role  above  was  illustrated  in  the  example  in 
section  2.  Rather  than  search  its  grammar  to  identify 
relative  clause  rules  which  had  to  be  changed,  ANT 
relied  on  examples  to  point  the  way  to  previous  rules. 
The  examples  identified  the  location  of  the  verb  in  the 
relative  clause  and  thus  gave  the  system  the  knowledge 
it  needed  to  be  able  to  relate  RC  to  CONJ-V. 

New  rules  to  be  learned  arc  often  formed  as  part  of 
the  parse  of  the  example.  After  the  parse  is  complete, 
part  of  the  parse  tree  is  an  instantiation  of  the  new 
grammar  rule.  Thus,  ANT  can  extract  the  new  rule 
directly  from  the  parse  of  the  example.  However,  the 
instantiation  is  often  more  specific  than  the  new  rule 
ought  to  be.  The  information  from  the  instruction  is 
used  to  determine  how  general  the  new  rule  can  be 
made. 

The  third  role  is  not  illustrated  by  our  relative  clause 
example.  However,  often  it  is  the  case  that  instructions 
found  in  textbooks  do  not  completely  specify  the  rule 
to  be  learned.  In  these  cases,  examples  must  fill  in  the 
details  left  out  of  the  instruction.  Because  of  the  way 
that  ANT  extracts  rules  from  example  parses,  this  is 
a  natural  by-product  of  ANT’s  learning  process. 

4  Learning  from  only  instructions  or 
only  examples 

it  seems  intuitive  that  learning  should  be  easier  if  tiie 
learner  is  provided  with  both  instructions  and  exam¬ 
ples  than  with  either  alone.  ANT’s  learning  process 
suggests  possible  reasons  for  this.  In  learning  from 
instructions  without  examples,  ANT  would  have  diffi¬ 
culties  with  the  following  tasks: 

•  identifying  relevant  previous  knowledge 

•  relating  terms  to  constituents 
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•  inferring  details  not  mentioned  in  the  instruction 

As  we  saw  in  the  relative  clause  example,  with¬ 
out  examples  to  guide  the  search  for  relevant  English 
grammar  rules,  finding  these  rules  could  potentially  be 
very  costly.  In  the  worst  case,  finding  relevant  rules 
could  mean  inspecting  the  entire  grammar.  Determin¬ 
ing  the  correct  relevant  rules  may  not  even  be  possible 
if  there  are  several  alternatives  from  which  to  choose 

Another  difficulty  is  that  ANT’s  internal  representa¬ 
tion  of  grammatical  categories  may  not  exactly  match 
the  terminology  used  in  the  instructions.  For  example: 

In  German  statements,  the  verb  must  be  the 
second  constituent. 

It  is  not  clear  from  this  instruction  what  constitutes 
a  “constituent.”  Without  extensive  further  explana¬ 
tion,  there  are  several  possible  interpretations  of  this 
instruction.  For  example,  it  is  not  clear  from  this  in¬ 
struction  what  the  German  equivalent  of  ”In  the  park 
the  man  slept”  would  be.  There  are  several  possibili¬ 
ties  (using  English  words  instead  of  German); 

In  slept  the  park  the  man. 

In  slept  the  man  the  park. 

In  the  park  slept  the  man. 

The  man  slept  in  the  park. 

The  third  choice  above  is  the  correct  one,  but  deter¬ 
mining  this  requires  knowing  what  constitutes  a  con¬ 
stituent.  Does  a  preposition  alone  qualify,  or  an  entire 
prepositional  phrase?  Also,  what  is  the  ordering  of  the 
constituents  after  the  verb?  Without  further  explana¬ 
tion  as  to  what  the  term  “constituent”  refers  to,  ANT 
cannot  deduce  what  the  correct  German  rule  is. 

Finally,  even  if  terminology  is  not  a  problem,  often 
instructions  simply  do  not  contain  all  the  necessary 
information  to  infer  the  correct  German  grammar  rule. 
For  example: 

In  German,  the  verb  “habeii”  used  with  the 
adverb  “gern”  means  “to  like.” 

Many  German  textbooks  leave  out  the  details  of  this 
rule,  such  as  where  the  object  of  “haben”  should  ap¬ 
pear  in  the  sentence  (after  the  verb  but  before  “gern”). 
'I'liis  is  more  restrictive  than  general  rules  for  place¬ 
ment  of  adverbs  in  German. 

There  is  an  obvious  solution  to  the  problem  that  in¬ 
structions  leave  out  details:  one  might  simply  insist 
that  instructions  be  written  so  as  to  include  details 
that  are  often  left  out.  However,  in  designing  a  learn¬ 
ing  system,  it  is  desirable  for  that  system  to  be  in¬ 
structed  in  a  way  that  is  natural  for  people.  Judging 
from  language  textbooks,  instructors  find  it  most  nat¬ 
ural  to  leave  out  some  information  from  instructions, 
relying  on  examples  to  enable  the  reader  to  infer  the 
details  of  the  rule. 

For  ANT,  learning  from  examples  without  instruc¬ 
tions  would  also  be  problematic.  In  particular,  the 
following  tasks  would  be  more  difficult: 


•  parsing  examples 

•  identifying  relevant  features 

•  generalizing  new  rules 

As  we  stated  before,  the  instiuction  tells  ANT  in 
what  ways  the  examples  will  deviate  from  English 
graiiiinar.  Without  this  information,  it  is  often  possi¬ 
ble  to  arrive  at  the  wrong  parse  of  an  example.  With¬ 
out  the  correct  structural  analysis,  ANT  cannot  induce 
the  correct  German  grammar  rules. 

Even  if  the  correct  structural  analysis  is  arrived  at 
for  an  example,  there  may  be  other  problems.  The 
number  of  possible  features  on  which  to  base  an  hy¬ 
pothesis  is  very  large.  Without  instructions,  ANT  can¬ 
not  know  whether  the  point  of  the  example  is  to  show 
case  agreement,  a  word  order  modification,  a  novel 
German  construction  for  which  there  is  no  equivalent 
English  construction,  and  so  on.  Thus,  we  would  ex¬ 
pect  the  learning  process  to  be  much  slower  if  instruc¬ 
tions  were  not  included  in  the  input. 

Finally,  even  if  the  correct  features  are  identified, 
there  is  the  issue  of  how  general  the  new  rule  is.  Ex¬ 
amples  alone  cannot  always  convey  the  correct  con¬ 
ditions  for  the  application  of  a  rule.  For  example,  in 
German  there  is  a  rule  that  the  verb  must  be  the  sec¬ 
ond  constituent  in  a  statement.  Suppose  the  system 
processes  some  examples  of  statements.  It  is  plausible 
that  the  system  (or  a  person)  could  realize  the  gen¬ 
eralization  to  be  made  is  that  the  verb  is  second,  but 
not  know  whether  this  applies  to  all  verbs  (like  those 
in  clauses),  to  only  the  main  verb  in  the  sentence,  to 
verbs  in  questions,  and  so  on. 

5  An  empirical  comparison 

In  order  to  demonstrate  empirically  that  learning  from 
instructions  and  examples  is  easier  than  learning  from 
instructions  only  or  examples  only,  we  implemented 
versions  of  ANT  which  learn  from  instructions  only 
and  from  examples  only.  We  then  compared  perfor¬ 
mance  of  tliese  versions  on  a  small  subset  of  the  rules 
that  original  ANT  learns. 

5.1  Instrnctions-only  ANT 

The  comparison  between  original  ANT  and  the  version 
which  learned  from  instructions  only  focused  on  the 
amount  of  searching  through  the  grammar  that  was 
necessary  in  instructions-only  ANT.  This  search  was 
not  necessary  at  all  in  original  ANT,  since  processing 
e.\amples  yielded  the  relevant  English  rules  as  a  by¬ 
product  of  parsing. 

Given  knowledge  of  how  to  find  an  embedded  con¬ 
stituent  by  searching  the  grammar,  instructions-only 
ANT  was  given  five  reordering  instructions  to  learn. 
These  instructions  were  the  following: 

(1)  In  a  statement,  the  verb  is  in  second  po¬ 
sition. 
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Instruction 

Rules  searched 

Items  searched 

1 

5 

7 

2 

23 

15 

3 

2 

3 

4 

5 

8 

5 

10 

12 

Figure  2:  Performance  of  Instructions-only  ANT 


(2)  When  a  dependent  clause  begins  a  sen¬ 
tence,  the  verb  of  the  independent  clause  im¬ 
mediately  follows  the  dependent  clause. 

(3)  In  a  relative  clause  the  verb  comes  at  the 
end. 

(4)  In  a  dependent  clause  the  verb  comes  at 
the  end. 

(5)  When  a  modal  auxiliary  occurs  in  a  state¬ 
ment,  the  infinitie  comes  at  the  end. 

The  instructions  were  randomly  chosen  from  the  set 
of  reordering  instructions  that  original  ANT  learns. 
Instructions-only  ANT’s  task  was  to  find  the  relevant 
English  rule  which  would  be  affected  by  the  instruc¬ 
tion,  and  then  change  that  rule. 

ANT  stopped  its  search  as  soon  as  it  found  any  in¬ 
stance  of  the  constituent  it  was  looking  for.  For  exam¬ 
ple,  in  the  relative  clause  rule,  ANT  stopped  as  soon 
as  it  found  any  conjugated  verb  (CONJ-V)  within  a 
relative  clause.  This  sometimes  led  to  the  incorrect 
selection  of  English  rules  to  be  modified. 

Figure  2  shows  the  results  for  the  five  instructions. 
The  number  of  rules  examined  is  given,  as  well  as  the 
number  of  constituents,  arrived  at  by  summing  the 
number  of  distinct  consituents  on  the  right  hand  side  of 
each  context-free  rule  that  was  searched.  The  average 
number  of  rules  search  is  9,  about  20%  of  the  size  of 
the  entire  grammar. 

Recall  that  the  search  terminated  as  soon  as  any  in¬ 
stance  of  the  desired  constituent  was  found.  Unfortu¬ 
nately,  the  wrong  instance  was  found  first  for  3  of  the 
5  instructions.  This  suggests  that  the  search  should 
continue  even  after  finding  the  desired  constituent,  to 
.SCO  if  other  instances  of  the  constituent  can  he  found. 
The  result  would  be  a  much  more  extensive  search 
In  addition,  it  is  not  clear  how  instructions-only  AN'P 
would  be  able  to  decide  which  instance  of  tiie  con¬ 
stituent  should  be  modified. 

5.2  Exampies-only  ANT 

In  designing  the  examples-only  version  of  ANT,  many 
assumptions  had  to  be  made  along  the  way,  as  there 
were  no  examples-only  second  language  acquisition 
systems  which  we  could  directly  compare  to  ANT  ® 

*One  possible  candidate  for  comparison  was  RINA 
(Zernik  and  Dyer,  1987),  but  this  system  learned  mainly 
idiomatic  expressions  rather  than  basic  grammatical  con 
struclions,  and  received  more  feedback  from  the  teacher 


We  tried  to  be  as  generous  as  possible  to  the  examples- 
only  approach  with  these  assumptions,  greatly  simpli¬ 
fying  the  task  at  times  so  as  to  be  sure  not  to  bias  the 
comparison  in  favor  of  the  instructions-and-examples 
approach. 

One  assumption  we  made  was  that  transfer  from  the 
native  language  would  be  useful  in  the  examples-only 
approach.  Intuitively,  it  seems  that  learning  should 
proceed  faster  if  the  system  started  with  its  knowledge 
of  English  rather  than  from  scratch.  Also,  there  is 
much  evidence  that  people  at  least  selectively  transfer 
native  language  information  to  the  learning  of  a  second 
language.  (Selinker,  1969;  Kellerman,  1987;  Anderson, 
1983;  Gass,  1980) 

Another  simplification  we  made  was  that  only  rules 
having  to  do  with  deviations  from  English  word  order 
would  be  learned.  This  meant  the  system  had  many 
fewer  possible  features  to  consider  in  the  input  exam¬ 
ples  when  hypothesizing  new  rules;  for  example,  it  did 
not  need  to  be  concerned  with  agreement,  case,  etc. 
Next,  part  of  the  input  to  the  examples-only  version 
of  the  system  was  the  parse  of  each  of  the  examples. 
This  greatly  simplified  the  task  of  hypothesizing  new 
rules,  since  the  system  did  not  need  to  guess  at  which 
constituents  should  be  grouped  together.  The  learning 
task,  then,  was  to  infer  how  general  the  rule  should  be 
which  was  illustrated  by  the  set  of  examples  and  their 
parses.  Finally,  an  assumption  was  made  about  the 
conditions  for  a  new  rule:  it  was  assumed  that  the 
presence  of  a  constituent,  rather  than  its  position  or 
some  other  feature,  would  be  the  only  conditions  un¬ 
der  which  a  new  rule  was  required.  For  example,  a  rule 
such  as  “if  the  direct  object  is  a  personal  pronoun,  then 
it  precedes  the  indirect  object”  could  not  be  learned 
under  this  assumption,  since  its  condition  does  not  de¬ 
pend  simply  on  the  presence  of  a  direct  object,  but  on 
additional  features  of  this  constituent. 

To  learn  from  examples  only,  ANT  performed  the 
following  analyses  on  the  input  it  was  given: 

1.  find  all  differences  between  the  form  of  the  rules 
used  in  the  parse  and  the  original  English  rules 

2.  find  all  descriptions  possible  of  the  ordering  that 
occurs  in  the  parse  tree 

3.  find  all  possible  conditions,  derivable  from  the 
current  example,  under  which  the  rule  to  be 
learned  might  be  restricted,  then  find  the  intersec¬ 
tion  of  all  the  conditions  for  all  examples  analyzed 
thus  far 

4.  enumeraie  the  descriptions  which  are  in  the  inter¬ 
section  of  all  differences  noted  thus  far  for  the  set 
of  examples  and  which  are  in  the  intersection  of 
all  descriptions  found  in  (2). 

These  four  steps  are  described  below. 


lli.in  the  traditional  examples-only  paradigm  of  many  ma¬ 
chine  learning  programs. 
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Finding  differences 

To  process  the  examples,  ANT  analyzes  the  parse 
trees  for  the  examples  (which  it  is  given  as  part  of  its 
input),  one  at  a  time.  The  system  looks  at  each  node 
in  the  tree,  in  search  of  a  grammar  rule  whose  right- 
hand-side  enumerates  its  children  in  the  correct  order. 
If  such  a  rule  cannot  be  found,  then  ANT  looks  for  a 
rule  which  has  the  same  constituents  but  not  in  the 
correct  order.  For  example,  let  us  assume  the  English 
rules  for  STMT  are  the  following: 

(17)  STMT  ->  PP  NP  VG  NP 

(18)  STMT  NP  VG  NP 

Now  say  that  an  example  is  presented,  such  as: 

Im  Park  spielen  die  Kinder  FuBball. 

(In  the  park  play  the  children  soccer) 

Since  the  construction  of  this  example  is  PP  VG 
NP  NP,  ANT  first  looks  for  a  rule  which  matches  this 
construction,  STMT  -+  PP  VG  NP  NP.  Since  no  such 
rule  can  be  found,  ANT  picks  STMT-*  PP  NP  VG 
NP  as  the  closest  English  rule.  It  then  enumerates  the 
differences  between  the  rule  and  the  ordering  in  the 
parse  tree. 

Finding  tlio  list  of  features 

Next  ANT  needs  to  find  all  of  the  features,  in  terms 
of  ordering,  that  exist  in  the  tree,  so  that  the  common 
features  across  all  examples  are  remembered.  AN'l' 
not  only  should  extract  the  features  in  its  examples 
that  show  differences  with  the  original  rules.  It  also 
should  extract  the  features  common  in  all  c.xamplas, 
saving  those  as  candidates  for  the  generalization  it  is 
trying  to  make.  For  example,  when  ANT  works  with 
the  instruction  about  the  verb  coming  second  in  the 
sentence,  it  may  first  receive  an  example  that  uses  the 
rule  20  below: 

(19)  STMT  PP  NP  VG  NP 

(20)  STMT  -*  NP  VG  NP 

In  this  case,  ANT  will  not  be  able  to  detect  any  con¬ 
stituent  ordering  changes,  since  the  original  rule  has 
the  same  constituent  ordering  as  the  example.  How¬ 
ever,  all  features  of  the  example  are  potential  candi¬ 
dates  for  generalization,  depending  on  differences  be¬ 
tween  English  rules  and  subsequent  exam|)le.s  in  the 
example  set.  Say  the  next  example  in  the  set  utilized 
rule  19  above.  This  time  the  ordering  would  be  differ¬ 
ent,  since  the  VG  in  the  example  would  be  second.  One 
of  ANT’s  hypotheses  for  generalization  then  would  be 
that  the  VG  is  second  in  STMT.  In  order  to  verify 
that  this  feature  is  significant,  and  hence  still  a  candi¬ 
date  for  the  generalization  it  should  make,  ANT  must 
know  if  this  particular  feature  occurred  in  all  previous 
examples. 

Finding  conditions 

The  system  must  be  capable  of  learning  the  condi¬ 
tions  under  which  the  changes  that  it  learns  apply.  The 
range  of  possible  conditions  has  been  simplifi.3d  by  our 


assumption  that  the  only  conditions  that  can  apply 
are  those  that  involve  ',he  presence  of  a  constituent  in 
a  particular  category.  Thus  the  system  would  be  able 
to  determine  the  condition  in  a  rule  to  be  learned  like 
“When  a  modal  auxiliary  occurs  in  a  sentence,  the  in¬ 
finitive  comes  at  the  end,”  but  it  would  not  be  able 
to  detect  the  condition  in  “When  a  dependent  clause 
begins  a  sentence,  the  verb  of  the  independent  clause 
follows  the  dependent  clause.” 

Finding  featni’es  common  to  all  examples 

After  it  is  done  processing  an  example,  ANT  com¬ 
bines  the  information  from  that  example  with  the  cu¬ 
mulative  data  gathered  from  examples  thus  far.  It 
finds  the  intersection  of  differences  found  in  the  gram¬ 
mar  and  computes  the  union  of  that  set  with  the  in¬ 
tersection  of  what  is  found  in  all  examples.  In  other 
words,  ANT  arrives  at  hypotheses  for  what  the  gen¬ 
eralization  of  the  rule  might  be  both  from  what  is  in 
common  with  all  previous  examples  and  from  what  al¬ 
terations  of  the  original  grammar  were  evident  in  the 
examples.  The  goal  is  to  derive  a  minimal  set  of  hy¬ 
potheses,  usually  one,  for  what  the  general  lesson  is, 
where  the  general  lesson  is  basically  equivalent  to  the 
instruction  in  the  instructions  and  examples  version. 
Along  with  this  generalization,  the  system  should  de¬ 
tect  any  relevant  conditions. 

5.3  Results 

The  examples-only  algorithm  was  run  on  five  sets  of 
examples,  and  its  performance  was  compared  to  the 
performance  of  the  examples-and-instructions  version 
of  ANT  on  these  rules.  In  the  examples-only  imple¬ 
mentation,  the  correct  rule  was  learned  for  four  of  the 
five  sets,  while  one  set  of  examples  resulted  in  an  in¬ 
correct  generalization.  The  reason  for  this  incorrect 
generalization  is  discussed  below.  Moreover,  ANT  was 
much  slower  at  inducing  the  new  grammar  rule  v/hen 
using  only  examples.  On  average,  the  examples-only 
version  required  an  average  of  4.0  examples  per  rule 
learned,  while  an  average  of  only  1.6  examples  were 
required  in  the  instructions-and-examples  approach. 
Figure  3  compares  the  number  of  examples  required 
to  learn  each  rule  for  the  two  algorithms. 

'I'lic  rules  used  in  the  comparison  were: 

(1)  The  verb  is  the  second  constituent  in  a 
sentence. 

(2)  The  verb  comes  at  the  end  of  the  relative 
clause. 

(3)  The  verb  comes  at  the  end  of  a  dependent 
clause. 

(4)  When  a  dependent  clause  begins  a  sen¬ 
tence,  the  verb  of  the  sentence  follows  the 
dependent  clau.se. 

(5)  When  a  modal  auxiliary  occurs  in  a  sen¬ 
tence,  the  infinitive  comes  at  the  end. 

Rule  number  five  was  not  correctly  learned  from  ex¬ 
amples  only,  in  fact,  given  the  assumptions  that  we 
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Figure  3:  Comparison  of  Performance  of  Original  and 
Examples-only  ANT 


made,  no  set  of  positive  examples  could  eventually 
cause  the  system  to  derive  the  correct  condition.  This 
is  because  ANT  cannot  rule  out  from  positive  examples 
the  alternative  hypothesis  that  if  an  infinitive  occurs 
in  a  sentence,  then  a  modal  auxiliary  is  also  required. 
This  alternate  hypothesis  is  consistent  with  the  data, 
and  cannot  be  ruled  out  by  positive  examples  which 
conform  to  rule  5. 

6  Conclusion 

We  have  presented  a  theory  of  learning  which  utilizes 
both  instructions  and  examples  to  learn  grammar  rules 
of  a  second  language.  This  theory  suggests  several 
functional  roles  for  both  instructions  and  examples  in 
learning.  Performance  of  our  system  declines  substan¬ 
tially  if  it  is  required  to  learn  from  only  instructions  or 
only  examples,  both  in  terms  of  efficiency  and  correct¬ 
ness  of  learning.  In  learning  from  only  instructions, 
efficiency  declines  in  terms  of  the  search  required  to 
determine  what  previous  knowledge  is  relevant  to  the 
new  information.  In  learning  from  only  examples,  effi¬ 
ciency  is  measured  in  terms  of  the  number  of  e.xamples 
required  to  learn  a  rule. 

The  discrepency  in  learning  efficiency  would  poten¬ 
tially  have  been  much  greater  if  we  had  not  made 
generous  assumptions  about  our  instructions-only  and 
examples-only  approaches.  In  both  approaches,  we  as¬ 
sumed  that  all  the  rules  learned  would  involve  word  or¬ 
der  changes,  as  this  vastly  simplified  the  learning  task. 
In  the  instructions-only  approach,  we  allowed  our  sys¬ 
tem  to  terminate  its  search  once  any  instance  of  the 
desired  constituent  was  found.  This  often  led  to  incor¬ 
rect  selection  of  English  rules.  In  the  examples-only 
approach,  we  assumed  that  all  the  rules  would  depend 
only  on  the  presence  or  absence  of  a  constituent  in 
the  sentence.  In  reality,  many  rules  specify  other  as¬ 


pects  of  grammar,  or  depend  on  other  features,  posi¬ 
tion  of  a  constituent,  case,  or  other  features.  Relaxing 
these  assumptions  would  mean  that  the  space  of  possi¬ 
ble  hypotheses  to  consider  in  deriving  a  rule  would  be 
vastly  larger  in  either  appraoch.  This,  in  turn,  would 
increase  the  amount  of  grammar  search  or  the  num¬ 
ber  of  examples  required  to  derive  a  new  rule.  In  the 
examples-and-instructions  version  of  ANT,  however, 
these  restrictions  were  not  imposed.  Thus,  the  effi¬ 
ciency  comparison  presented  here  is  most  likely  biased 
against  the  examples-and-instructions  approach. 
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Abstract 

A  (string)  pattern  is  a  non-null  string  over 
an  alphabet  and  a  set  of  variables.  A  pat¬ 
tern  language  is  the  set  of  all  string  :  obtained 
by  substituting  non-null  constant  strings  for 
variables  in  a  pattern.  We  investigate  learn¬ 
ing  pattern  languages  in  three  new  direc¬ 
tions.  First  we  prow  that  to  decide  whether 
there  is  a  string  pattern  consistent  with  given 
positive  and  negative  samples  is  A^P-hard. 

Thus  it  is  not  polynomial-time  learnable  un¬ 
der  Valiant’s  probabilistic  learning  model  un¬ 
less  RP  =  NP. 

Then  we  discuss  c.xtensions  of  examples  and 
string  patterns:  incomplete  examples  and 
tree  patterns.  We  prove  that  to  decide 
whether  there  is  a  common  fc-variable  string 
pattern  for  given  mcompleie  positive  exam¬ 
ples  is  A*/’-completc  for  any  fixed  integer  k  > 

2.  We  give  polynomial-time  algorithms  to 
find  common  variable  tree  patterns  for  non- 
associaiive,  non-commutative  constant  trees 
for  any  fixed  integer  k  >  1.  We  also  prove 
that  to  decide  whether  there  is  a  common  k- 
variablc  tree  pattern  for  associative,  commu¬ 
tative  constant  trees  is  AfP-complete  for  any 
fixed  integer  k>2. 

1  Introduction 

A  (string)  pattern  p  is  a  non-null  string  over  a  con¬ 
stant  alphabet  E  and  a  set  X  of  variables.  The 
pattern  language  L(p)  defined  by  a  pattern  p  is  the 
set  of  all  strings  over  E  which  can  be  obtained  by 
substituting  non-null  constant  strings  for  variables 
in  p,  for  e.xample,  LdOaiajOl)  z=  {10 00 01.  101101. 
10000001,  10010101,  10101001,  10111101,...}.  In 
this  paper  we  are  concerned  with  learning  pat¬ 
terns  and  certain  generalized  patterns  from  e.xam- 
ples.  Our  model  of  learning  is  based  on  zhe  e.vact- 
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information,  worst-case  complexity  model,  in  contrast 
to  Valiant’s  probabilistic-information,  ])robabilistic- 
complexity,  distribution-free  model  (Va84).  Some  re¬ 
sults  on  learning  pattern  languages  in  Valiant’s  model 
can  be  derived  as  consequences  of  our  results  through 
the  work  of  Blunter  et  al.  [BEHW89). 

Learning  of  patterns  was  introduced  by  An- 
gluin  (An80).  She  gave  a  polynomial-time  algorithm 
(in  fact,  a  nondeterministic  log-space  algorithm)  for 
finding  the  longest  one-variable  pattern  from  sample 
strings  (or,  from  positive  examples  only).  Following 
the  principle  of  Occam’s  razor  [BEHW87j,  the  longest 
pattern  will  guarantee  that  the  solution  is  fittest  in 
the  sense  that  no  other  suitable  pattern  language  is  a 
proper  subset  of  it.  .For  n  >  2,  the  problem  of  finding 
the  longest  two-variable  pattern  from  positive  exam¬ 
ples  is  left  open.  Ko  and  Hua  (KH87)  showed  that  a 
straightforward  generalization  of  Angluin’s  algorithm 
for  the  two-variable  case  does  not  work  in  polynomial 
time  unless  P  =  NP.  It  suggests  that  this  problem 
may  be  A^P-complete.  For  the  case  where  the  number 
of  variables  is  not  fixed,  even  the  simple  membership 
problem  for  pattern  languages  (i.e.  given  a  pattern  p 
and  a  constant  string  s,  determine  whether  s  €  L{p)) 
is  known  to  be  ATP-complete  (An80]. 

We  attack  this  problem  in  three  new  directions. 
First,  we  consider  the  problem  of  learning  patterns 
from  both  positive  and  negative  examples.  More 
precisely,  the  new  problem  FCP  of  finding  common 
patterns  from  both  positive  and  negative  examples  is: 
given  two  sets  S  and  T  of  constant  strings,  decide 
whether  there  exists  a  pattern  p  consistent  with  S  and 
T,  i.e.,  5  C  L{p)  and  T  C  L{p). 

We  explain  this  problem  by  a  possible  applica¬ 
tion.  Molecular  biologists  gather  some  sequences  of 
nucleotides  of  some  segment  of  DNA  which  is  corre¬ 
sponding  to  some  kind  of  disease.  Some  sequences 
are  from  normal  people  and  some  are  from  patients. 
Then  they  try  to  find  (learn)  a  genetic  pattern  which 
can  explain  this  phenomenon.  There  are  two  inter¬ 
pretations  of  gathered  data.  The  first  one  is  that 
a  sequence  of  nucleotides  is  normal  if  it  can  be  ob¬ 
tained  from  a  genetic  pattern  by  substituting  some 
nucleotides  for  some  specific  positions,  otherwise  it  is 
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abnormal.  In  this  case  ])osilive  e.xamples  arc  those 
normal  sequences  and  negative  examples  are  those  ab¬ 
normal  sequences.  For  example,  if  the  genetic  pattern 
AGpjCs'AT  is  learned  from  normal  .sequences  AGTC- 
TAT  and  AG  GOG  AT,  and  abnormal  sequences  AG- 
GGAAT  and  AGTCAAT,  then  AGACCAT  is  an  ab¬ 
normal  sequence  of  nucleotides  which  may  cause  a  dis¬ 
ease,  where  {A,C,  G',7’}  are  four  types  of  nucleotides. 
The  second  one  is  that  an  abnormal  sequence  of 
nucleotides  luis  some  specific  sub-sequences  of  nu¬ 
cleotides.  In  this  case,  the  positive  examples  are  those 
abnormal  sequences  and  negative  examples  are  those 
normal  sequences.  A  sequence  is  abnormal  if  it  can 
obtained  from  the  learned  genetic  pattern.  This  is  be¬ 
cause  the  specific  disease  sub-sequences  will  appear  as 
sub-sequences  of  the  learned  genetic  pattern.  For  ex¬ 
ample,  the  genetic  pattern  AGaATyC  is  learned  from 
abnormal  secpiences  AGCATTC  and  AGAATGC,  and 
normal  sequences  ACTTCAC  and  AGCAGGC,  then 
AGTATGC  is  an  abnormal  sequence.  Our  problem 
corresponds  to  this  learning  procedure,  where  nu¬ 
cleotides  correspond  to  £.  For  the  first  interpretation 
of  data,  the  specific  positions  of  the  genetic  pattern 
correspond  to  the  variables  of  our  pattern.  For  the 
second  interpretation,  tic  specific  sub-sequeines  cor¬ 
respond  to  constant  sub-strings  of  our  patterns.  With 
this  tool,  scientists  can  tell  whether  an  unborn  baby 
has  a  risk  of  some  kind  of  disease  througli  a  genetic 
examination. 

We  .show  that  the  problem  FCP  is  AP-hard  when  the 
number  of  variables  in  p  is  not  fixed.  An  interesting 
consequence  of  this  result  is,  from  the  general  result 
of  Blumer  et  al.,  that  learning  patterns  in  Valiant’s 
probabilistic  learning  model  cannot  be  done  in  poly¬ 
nomial  time  unless  RP  -  NP.  Kearns  and  Pitt  (KP89] 
also  considered  learning  pattern  languages  in  this  di¬ 
rection.  They  gave  a  polynomial-time  algorithm  for 
learning  At-variable  pattern  languages,  for  any  fixed 
k  ^  1,  under  arbitrat,>  pioducl  distribution.  How¬ 
ever,  their  algorithm  outputs  a  set  ofsiniider  patterns 
with  k  or  less  variables  instead  of  one  single  /c-variable 
pattern.  Thus,  the  preiise  complc.\ity  of  learning  k- 
variable  patterns  from  positive  and  negative  e.vamples 
and  outputting  a  single  A'-variable  pattern  Is  still  left 
open  for  any  fixed  k  ^  1.  This  problem  falls  into  the 
complexity  elass  A’P  since  the  membership  problem  of 
A:-variable  patterns  is  in  P  for  any  fixed  k  ^  i.  Re¬ 
cently,  Schapirc  [ScSO]  also  showed  that  pattirn  lan¬ 
guages  arc  not  harnabk  in  V'aliant’s  learning  model  if 
P/pvly  /  NP/puly.  However,  he  allows  empty-string 
substitution  of  variables.  This  model  is  difl'erent  from 
ours  that  prohibits  empty-string  substitution.  It  ap¬ 
pears  that  this  slight  difference  does  afi'ect  the  com¬ 
plexity  of  tlm  learning  problem. 

The  second  approach  is  to  learn  patterns  from  in- 
coinplett  examples.  This  problem  can  also  apply  to 
the  above  genetii.  pattern  problems.  It  is  not  unusual 
that  experiment  data  are  not  perfect.  They  may  miss 
some  parts  of  genetic  sequences  and  thus  we  may  have 


incomplete  samples. 

We  say  a  string  s'  £  S'*"  is  consistent  with  a  string 
s  G  (Su{?})+  (assuming  ?  ^  S)  if  |s'l  =  |s|  and  for  ev¬ 
ery  t,  s'(i)  =  s(t)  whenever  s(t)  yi?,  where  s(i)  denotes 
the  itli  character  in  s.  Let  2>  be  a  pattern.  A  string 
s  G  (Su{?})^  is  an  incomplete  positive  example  for  p  if 
there  exists  an  s'  G  S  '"  such  that  s'  is  consistent  with 
s  and  s'  G  L(p)-  The  question  here  is  to  determine 
whether  incomplete  examples  make  learning  more  dif¬ 
ficult.  Our  result  supports  an  affirmative  answer:  for 
any  fixed  fc  >  2,  the  problem  of  finding  the  longest 
fc-variable  pattern  from  incomplete  positive  examples 
is  A'P-complete  (while,  as  discussed  above,  the  cor¬ 
responding  problem  using  complete  positive  examples 
is  still  not  known  to  be  AP-complete).  However,  the 
1-variable  case  of  the  problem  remains  open  (while, 
the  1-variable  case  of  complete  positive  examples  is 
polynomial-time  solvable  [AnSO]). 

The  third  approach  is  to  generalize  the  string  pat¬ 
terns  to  two-dimensiontal  tree  patterns  and  adding  the 
comimdativity  property  to  the  concept  of  patterns. 
The  idea  of  tree  patterns  comes  from  the  area  of  term 
rewriting  of  theorem  proving  [BICNS-S]  [VR89].  A  func¬ 
tion  /  which  corresponds  to  an  internal  node  of  a 
G'-e  is  commutative  if  f{a,b)  -  f{b,a)y  where  a  and 
b  arc  two  subtrees  of  node  /.  A  function  g  is  asso¬ 
ciative  if  5(0,17(6,  c))  -  5(5(0, 6),  c).  A  tree  pattern 
is  a  tree  with  function  symbols  associated  with  inter¬ 
nal  nodes  and  constants  or  variables  on  leaves.  A  tree 
pattern  generates  constant  trees  (trees  with  no  vari¬ 
ables)  by  substituting  constant  trees  for  each  variable. 
The  membership  problem  for  tree  patterns  is  to  deter¬ 
mine,  for  a  given  tree  pattern  t  and  a  constani  tree  s, 
whether  s  can  be  derived  from  t  by  a  substitution.  This 
problem  is  called  term  matching  in  the  area  of  term 
rewriting.  Efficient  algorithms  have  been  developed 
for  several  variations  of  this  problem  (e.g.  [BKN85]). 

A  string  pattern  can  be  viewed  as  a  depth- 1  tree  pat 
tern  having  the  associativity  property.  Thus,  removing 
the  associativity  property  from  a  tree  pattern  would 
potentially  simplify  the  learning  problem,  and  adding 
the  commutativity  property  would  make  it  more  dif¬ 
ficult.  Our  results  support  this  intuition.  We  show 
that  learning  a  tree  pattern  without  the  associativity 
and  commutativity  properties  can  be  done  in  polyno¬ 
mial  time  no  matter  whether  the  number  of  variables 
is  fixed  or  not.  On  the  other  hand,  we  show  that  if 
a  tree  pattern  is  allowed  to  have  the  associativity  and 
commutativity  properties,  then  the  learning  ]irobleni 
is  AP-complcte  for  the  two- variable  case. 

The  above  AP-completeness  results  (incomplete 
examples  and  tree  patterns)  seem  difficult  to  be 
strengthed  to  the  one-variable  case.  Intuitively,  the 
one-variable  patterns  do  not  havr  complex  structures 
for  reduction  proofs  of  AT-hardmss.  Its  complexity 
is,  intuitively,  closer  to  the  original  two  variabk  string 
pattern  learning  problem  of  complete  positive  exam¬ 
ples. 
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2  Preliminaries 

Let  S  be  a  .finite  alphabet  and  X  disjoint  ftoni  S 
be  a  countable  set.  Then  S*  denotes  the  set  of 
all  finite  (constant)  strings  over  S  and  S'*'  =  S*  — 
{ejnpty  siring}.  Elements  in  S  arc  called  constants. 
Elements  in  X  are  called  variables.  Let  siS2  denote 
the  concatenation  of  two  strings  sj  and  ,S2,  and  s’”  the 
concatenation  of  m  s’s.  Let  ||5||  denote  the  number  of 
elements  in  set  S. 

A  (string)  pattern  is  a  non  null  string  over  S  L'  X. 
Let  IpI  denote  the  length  of  pattern  p,  for  example, 
|a:0al0i/j/|  =  7.  A  pattern  p  is  called  k-variable  if 
there  are  no  more  than  k  distinct  variables  occurring 
in  p.  A  pattern  p  is  called  k-occurrevee  if  each  vari¬ 
able  in  j)  occurs  at  most  k  times.  Assume  that  p  is 
an  n- variable  pattern  with  variables  ?i,..  Then 
p[si/»ii  •  •  •) Sn/a'„]  denotes  the  string  obtained  from 
p  by  substituting  s,  for  each  occurrence  of  x,  in  p, 
£  =  1,2,  ...,7i.  The  language  L(p)  is  defined  to  be 

L{p)  =  {p[si/a:i,S2/a;2,  ■■•,«n/a:n)  = 

Si  €£■*■,!<  i  <n}. 

Note  that  if  s  6  L{p),  then  |s|  >  |p|.  Since  we  can 
check  whether  two  patterns  are  the  same  up  to  variable 
naming,  in  the  rest  of  this  paper  we  neglect  the  naming 
difierence  of  patterns. 

The  following  two  propositions  are  from  [Au80]. 
The  fiist  one  states  some  properties  of  pattern  lan¬ 
guages.  The  second  one,  as  we  said  before,  shows  that 
the  membership  problem  of  pattern  languages  is  .VP- 
conipletc. 

Proposition  2.1  27ie  class  of  pattern  languages  is  in- 
comparable  with  the  class  of  regular  languages  and  luilh 
the  class  of  context-free  languages.  The  class  of  pat¬ 
tern  languages  is  not  closed  under  any  of  these  oper¬ 
ations;  union,  complementation,  intersection,  Klcene 
plus,  homomorphism,  or  inverse  homomorphism.  It  is 
closed  under  concatenation  and  reversal. 

Proposition  2.2  The  membership  problem  for  pat¬ 
tern  languages  is  iiP -complete,  i.e.,  the  following  prob¬ 
lem  is  NP-comp/ete;  given  a  pattern  p  and  a  siring 
s  €  E"*",  decide  whether  s  (z  L[p). 

3  Learning  Patterns  from  Positive 
and  Negative  Examples 

3.1  Positive  Examples  Only 

Angluin  [An80]  gave  a  polynomial-time  algorithm  to 
find  the  longest  one- variable  pattern  for  a  set  of  strings 
(positive  examples),  while  the  comple.xity  of  the  iwo- 
variable  case  is  left  open  (cf.  [KH87]).  For  the  problem 
of  deciding  whether  there  exists  a  pattern  (not  neces¬ 
sarily  longest)  for  positive  examples  only,  we  should 
have  a  parameter  of  pattern  length,  otherwise  patkrn 
X  would  be  a  suitable  solution  fur  most  cases.  Without 
restriction  on  the  number  of  variables  of  target  pat 
terns,  we  can  easily  find,  in  deterministic  log  sp.ice, 
a  common  pattern  for  a  set  of  strings  with  pattern 


length  greater  than  an  input  parameter  r,  because  we 
only  need  to  check  that  the  shortest  string  of  the  set 
is  of  length  r'  >r  and  use  a;i!e2  •  •  ■  as  the  solution. 
However,  if  we  also  use  the  number  of  variables  as  a 
parameter,  the  problem  becomes  A^P-complete. 

Theorem  3.1  The  following  problem  is  NP-complete: 
given  a  finite  set  S  C  E"*"  and  two  integers  n  and  t, 
determine  whether  there  is  an  n-variablc  pattern  p  of 
length  >  i  such  that  S  C  L{p). 

This  is  proved  by  a  simple  reduction  from  the  nieniber- 
ship  problem  without  bounds  on  variables.  We  omit 
it  here. 

3.2  Positive  and  Negative  Examples 

In  this  section,  we  consider  the  problem  of  finding  pat¬ 
terns  (not  necessarily  longest)  from  positive  and  neg¬ 
ative  examples.  The  main  result  of  this  section  is  that 
without  bounds  on  the  pattern  length  and  the  number 
of  variables,  we  can  prove  that  the  ciuestion  of  deter¬ 
mining  whether  there  is  a  pattern  consistent  with  given 
positive  and  negative  examples  is  AP-hard. 

Theorem  3.2  2’/ie  following  problem  is  NP-hard: 
given  two  finite  sets  S  and  T  of  strings,  determine 
whether  there  is  a  jiattern  p  which  is  consistent  with  S 
and  T,  in  the  sense  that  S  C  L(p)  and  T  C  L(p), 

Proof.  We  reduce  the  3SAT  problem  [GJ79]  to  this 
problem.  Let  U  --  {t/i,  ti2)  •  ■  •>  »»}  be  the  set  of  vari¬ 
ables  and  C  —  {ui ,  02, . . . ,  c,„}  be  an  arbitrary  instance 
of  3SAT.  We  need  lo  construct  two  sets  S  and  T  such 
that  C  is  satisfiable  if  and  only  if  there  is  a  pattern  p 
consistent  with  5  and  T.  We  first  consider  the  restric¬ 
tive  case  that  a  consistent  pattern  must  have  length 
>  3n. 

Construct  the  following  positive  examples  sf’s  and 
sj’s,  and  negative  examples  tj’s,  0  £  £  <  7i  and  1  < 
j  <  n; 

50  =  'rir-2---r„, 
where  rjt  =  111 ,  1  £  A-  <  n; 

4  =  riTi.. .T„, 

where  r*  =  000,  I  <  k  <n] 

51  =  rir'2  . .  .r„, 

where  r,-  =  000,  rj  =  11 1,  1  £  A  £  71  and  k  ^  i\ 

S'i  =  7'ir2...7',., 

where  r.-  =  1100, r*.  =  111,  1  <  A;  <  71  and  k  ^  i; 
ij  ~  r\T2  .  •  ■  T*,, , 

where  Vj  =  101,  rj,  =  lll,l£A:£7i  and  k  ^  j. 

Let  S  =  ^“d  T"  -  Assume 

P  =  PiP2---Pn,  1Pi1  =  3,  1  <  £  <  71,  is  a  pattern  of 
length  371  which  is  consistent  with  S  and  T".  From 
positive  examples  so  and  vve  know  that  p,  contains 
3  variables  (no  eoiistants)  and  variables  in  p,  do  not 
occur  in  the  other  p/s  if  £  /  j.  From  the  negative 
example  t,,  the  middle  variable  of  p,  is  eejual  to  the 
first  or  the  last  one  of  p,.  By  s[,  p,  needs  to  match 
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1100,  100,  or  110  to  obtain  sj,  so  p,  cannot  be 
Thus  Pi  =  XiXiyi~oi  K.-yii/,-. 

For  each  clause  c,  =  {/t,, /.j./i,}  in  C,  1  <  i  <  ni, 
we  add  t-  as  a  negative  example  to  guarantee  that  at 
least  one  literal  in  c,  is  assigned  to  true,  i.e.,  not  all 
of  them  can  be  assigned  to  false.  For  1  <  i  <  in,  vie 
define 

i'i  =  Tiri...rn, 

where 

{  100  in,-,  =«fc,j=  1,2,3 

n  =  <  110  if =  Ufc,  j  =  1,2,  3 
[  111  otherwise. 

Let 

T'  =T"u(Uj’ii{t5}). 

Now  we  want  to  show  that  a  truth  assignment  r 
satisfies  C  if  and  only  if  tliere  is  a  p  consistent  with 
our  examples  S  and  T' .  For  the  forward  direction,  we 
define  pi  =  .t!, *,■»/,  if  7-(7t,)  —  true  and  p,  —  a!ii/,y,  if 
=  false  for  1  <  i  <  n.  We  can  easily  verify 
that  p  =  piP2--‘Pn  is  consistent  with  S  and  T'.  In 
particular,  for  each  cause  c,,  if  m  occurs  in  c,  and 
r{yk)  =iTue,  then  pk  =  XkXkjjk,  but  ij-  =  rir2---r„ 
with  rfc  =  100,  and  so  tj  ^  £(p);  similarly,  if  uji.  occurs 
in  c,-  and  t(u(;)  =  false,  then  pj.  =  xkVkiJk  and  n  in 
i'i  is  110,  and  if  ^  L{p). 

For  the  backward  direction,  if  there  exists  a  pattern 
P  =  PiP2  •••Pn  of  length  3n  consistent  with  5  and  T', 
we  define  r(tt,-)  =  true  if  p,-  =  XiXiyt  and  false  if 
Pi  =  XiViVi-  Then  for  each  clause  c,-  =  {li,,li,,lu}, 
1  ^  negative  example  <•  guarantees  that 

one  of  pij,  1  <  i  <  3,  makes  =  ir.. 

Last,  we  need  to  prove  that  no  patterns  of  length 
<  3n  can  be  consistent  with  5  and  T.  We  add  the 
following  strings  to  T: 

{!’■  ;  1  <  /<  371-1}, 
i.e.,  let 

T  =  T'u{l‘:l<i<37i-l}. 

If  some  shorter  pattern  q  is  consistent  with  5  and  T, 
then  all  its  positions  must  be  variables  since  so  and 
Sq  are  positive  examples.  Since  the  length  of  q  is  less 
than  3n,  ll^l  can  be  obtained  as  a  positive  example  by 
substituting  string  1  for  every  variable.  This  contra¬ 
dicts  our  negative  examples.  □ 

Remark.  Note  that  by  Proposition  2.2.  the  above 
problem  is  not  known  to  be  in  NP.  We  showed  that 
the  complexity  of  this  problem  is  complete  for  Ej 
respect  to  log-space  many-one  reduction,  where  Ej  is 
the  class  of  languages  recognized  by  nondeterministic 
oracle  Turing  machines  in  polynomial  time  relative  to 
oracle  sets  in  NP,  i.e.,  the  second  level  of  the  polyno¬ 
mial  time  hierarchy.  It  will  be  reported  in  a  subsequent 
paper  [TK90). 

The  above  proof  does  not  seem  applicable  to  the 
case  when  the  number  of  variables  is  fixed.  Thus,  the 
problem  of  finding  a  fc- variable  pattern,  k  >  1,  which 


is  consistent  with  positive  and  negative  examples  re¬ 
mains  open. 

Valiant  (Va84]  discussed  a  probabilistic  learning 
model.  In  his  model,  examples,  when  requested,  are 
provided  according  to  some  fixed,  but  unknown  prob¬ 
ability  distribution.  Learning  algorithms  are  required 
to  output  an  approximate  concept  using  polynomial 
number  of  examples  in  polynomial  time.  Blumer  et 
al.  {BEll  W89)  showed  a  general  result  that  the  polyno¬ 
mial  learnability  of  a  concept  under  Valiant’s  learning 
model  can  be  reduced  to  the  consistency  problem  of 
the  concept.  That  is  to  say,  the  approximate  learning 
of  Valiant  is  equivalent  to  the  worst-case  learning  if 
the  VC-dimension  of  the  domain  grows  polynomially 
with  respect  to  some  size  measure  of  the  domain.  In 
pattern  languages,  if  we  take  the  length  of  a  pattern 
as  the  natural  size  measure,  then  its  VC-dimension 
grows  linearly  by  a  simple  counting  argument.  As  a 
consequence  of  this  result,  we  have 

Corollary  3.3  Pailcrn  languages  are  not  polynomial¬ 
time  learnable  under  Valiant’s  learning  model  unless 
RP  =  NP,  where  RP  is  the  class  of  problems  which  can 
be  solved  by  randomized  polynomial-time  algorithms. 

4  Learning  Patterns  from  Incomplete 
Examples 

In  this  section  we  consider  generalized  examples,  i.e., 
incomplete  examp'es,  for  patterns.  With  this  general¬ 
ization,  we  can  prove  that  learning  the  longest  two- 
variable  pattern  from  incomplete  positive  examples 
is  A^P-complete.  llow-ever,  the  1-variable  case  of  the 
problem  remains  open. 

Assume  ?  ^  E.  We  say  a  string  s'  €  is  con¬ 
sistent  with  a  string  s  £  (Eu  if  js'j  =  |sl  and 

for  every  i,  s'(i)  =  s{i)  if  s(i)  ?,  where  s{i)  denotes 
the  fth  character  in  s.  A  string  s  £  (S  U  {?})■*■  is  an 
incomplete  positive  example  for  a  pattern  p  if  there  ex¬ 
ists  an  s'  £  E’^  such  that  s'  is  consistent  with  s  and 
s'  £  L(2>).  We  first  observe  that  the  complexity  of  the 
membership  problem  for  incomplete  examples  and  k- 
variable  j.>atterns  remains  the  same  as  that  of  complete 
examples  for  any  fixed  fc  >  1. 

Theorem  4.1  For  any  fixed  k  >  1,  there  is  a 
polynomial-time  algorithm  for  the  following  problem; 
given  an  incomplete  examples  and  a  k-variable  jiattern 
p,  decide  whether  s  is  an  incomplete  positive  example 
of  p 

Proof:  (Sketch)  Let  |s|  =  n.  Assume  the  number 
of  occurrences  of  x,  in  p  is  n,-,  \  <  i  <  k  and  the 
number  of  constants  in  p  is  iiq.  Let  s;  be  the  possible 
substitution  for  x;  to  get  s  as  an  incomplete  positive 
example.  We  have 

no  -k  nilsil  -f  n2|s2l  +  •  ■ '  k  wtlsfcl  =  n. 

We  consider  all  possible  |si|’s,  1  <  i  <  k,  which  satisfy 
the  above  equation.  There  are  at  most  n’’  of  them. 
Each  solution  determines  the  positions  of  the  substi¬ 
tution  strings  in  s,  as  well  as  constant  symbols  in  s. 
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For  each  of  them,  we  first  check  whether  the  corre¬ 
sponding  constant  symbols  of  p  and  s  are  consistent. 
If  it  is  so,  we  then  clieck,  for  each  whetlier  its  cor¬ 
responding  substrings  in  s  are  consistent,  i.e.,  0  and  1 
does  not  appear  at  the  same  corresponding  position. 
For  example,  0?100  and  C?110  are  not  consistent  and 
V0l?0  and  ?0?10  are  consistent.  Q 

Now  we  show  our  result  on  incomplete  positive  ex¬ 
amples. 

Theorem  4.2  The  following  problem  is  NP-complete: 
givc7i  a  finite  set  S  of  ineomplete  positive  examples  and 
an  integer  t,  determine  whether  there  is  a  two-variable 
string  pattern  p  of  length  >  i  such  that  each  s  ^  S  is 
an  incomplete  positme  example  for  p. 

Proof:  This  problem  is  in  NP  since  a  nondetermin- 
istic  Turing  machine  can  guess  a  two-variable  pattern 
p  and,  for  each  s  €  5,  guess  a  consistent  example  s' 
and  verify  that  s'  and  s  are  consistent  and  s'  6  L{p) 
in  polynomial  time. 

We  reduce  the  3SAT  problem  to  this  problem.  Let 
U  =  {«!, U2i •  •  Wii}  ^ke  set  of  variables  and 
C  =  {ci,  C2, . . . ,  c,,,}  be  an  arbitrary  instance  of  3SAT. 
Without  loss  of  generality,  we  assume  each  variable  w,-, 
1  <  i  <  n,  or  its  negation  ii;,  occurs  at  least  once  in 
some  clause  cj,  I  <  j  <  m.  First,  we  use  two  ex¬ 
amples  and  the  length  bound  to  form  a  tableau  for  a 
truth  assignment  for  C.  Define 

si  =  0#$"+*#  VM 

11 

and 

71 

With  length  bound  t  =  3n  -f  8  and  examples  sj  and 
S2,  any  2- variable  pattern  p  must  have  the  structure: 

where  x  and  y  are  variables  and  o;  G  {0, 1,  .X',  y,  ff,  $}. 
Let 

S3  =  00^#$"+ V/ *  #  00^^ 

n+l  n  n+l 

then  it  can  force  every  a;,  1  <  i  <  n,  be  a  constant. 
Thus,  the  center  part  ai  •  ■  -  a,,  of  p  may  be  viewed  as  a 
truth  assignment  for  C.  That  is,  T,,(wi)  =  true  if  and 
only  if  n,-  =  1. 

For  each  clause  c,-,  we  construct  a  string  s'-  such  that 
if  an  assignment  Tp  satisfies  c;  then  sj-  is  an  incomplete 
example  for  p.  Thus,  {ci , . .  .c„,}  is  satisfiable  if  and 
only  if  there  is  a  pattern  p  for  all  1  <  i  S  k'or 
each  clause  c;  =  {1;, , It,, /ij},  lij  =  Ui^  or  there 
are  seven  truth  assignments  on  variables  u,-,,  Ui,,  it,-, 
which  satisfy  c;  (e.g.,  kj  Va;2  V  zz  can  be  satisfied  by 
ttt,  itfi  tft,  tff,  fti,  fjt  and  fff,  where  i  means  true  and 
/  means  false).  Let  them  be 

3],  1  S  i  < 

where  6,[y,  A:]  €  {t,  f},  I  <  j  <  T,  I  <  k  <  3.  For  each 
c,,  we  define  7  substrings  r,,^,  each  of  length  n  (recall 
that  s(7n)  is  the  nith  character  of  string  s); 


{  1  \n>i[j,k]:=i,m=:  ik,k  =  1,2, ‘S 

T',,] (m)  =  <  0  if  hi [j,  A-]  =  /,  m  =  ik,k  =  1,2,3 

I  ?  otherwise. 

For  example,  assume  7i  =  6  and  C2  =  VS4  VaFc,  then 

r2.i  =  1??0?0, 

r2,2  =  1??0?1, 

r2,3  =  1??1?0, 

r2,4  =  1??1?1, 

T2,S  =  0??0?0, 

,‘2.6  =  0??0?1, 

7-2,7  =  0??0?0. 

We  define  sj-  = 

It  can  be  checked  that  a  pattern  p  matches  s'- 
only  when  matches  with  some  Vij,  1  < 

i  <  7  (and  x  matches  0^/1  ■  ■  •  5Kr,'.y_i  and  y  matches 
T'j.j+i#  ■  • -#0).  Thus  sample  S  =  {si,S2,ss}U 
(U"ij{s-})  satisfies  our  requirement. 

Note.  The  substring  of  every  example  can 

be  replaced  by  lO"'*''!  without  changing  the  proof.  So, 
||S||  =  2  is  sufficient  for  this  proof.  □ 

Through  an  easy  extension  of  the  above  proof,  we 
can  show  that  the  jVP-completencss  result  holds  for 
the  cases  A:  >  3. 

Corollary  4.3  For  any  k  >  2,  the  following  prob¬ 
lem  is  W-eomjdete:  given  a  finite  set  S  of  incomplete 
positive  examples  and  an  integer  t,  detei'mine  whether 
there  is  a  k-variable  string  pattern  p  of  length  >  t  such 
that  each  s  £  S  is  an  incomplete  example  for  p. 


5  Tree  Patterns 


In  this  section,  we  generalize  one-dimensional  string 
patterns  to  two-dimensional  tree  patterns.  Let  F  be  a 
countable  set  disjoint  from  S  and  X.  Elements  in  F 
are  called  function  sijmbols.  A  tree  pattern  is  a  non¬ 
null  rooted  directed  tree,  where  its  vertices  are  labeled 
and  edges  are  ordered.  The  internal  vertices  of  a  tree 
pattern  are  labeled  by  function  symbols  whose  direct 
subtrees  ate  their  arguments.  The  external  vertices 
(leaves)  are  labeled  by  constants  or  variables.  The  set 
L(i)  defined  by  tree  pattern  t  is  the  set  of  ai!  c^nslant 
trees  obtained  by  substituting  non-null  constant  trees 
for  variables  occurring  in  t-  For  each  tree  t,  we  let 
|tj  be  the  number  of  vertices  of  tree  pattern  1.  As  an 
example,  i  in  Figure  1  is  a  two-variable  tree  pattern. 
ii  and  t2  are  two  constant  trees,  iz  =  A[ti/a‘i,t2/w2] 
is  an  element  in  set  L{t)  by  substituting  ti  for  ai  and 
iz  for  *2. 

An  internal  vertex  labeled  by  /  is  called  associative 
if /^(o, //i(h,  c))  =  //i(/>i(a,h),c)  and  called  commu¬ 
tative  if  fc{a.,b)  =  fc{b,a),  where  subscripts  A  and 
(7  of  /  denote  the  associativity  and  commutativity 
properties.  A  vertex  could  be  associative,  commu¬ 
tative,  both,  or  neither,  at,.ording  to  its  associated 
function  symbol.  A  subtree  with  associative  vertices 
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Figure  h  Constant  Trees  ami  Tree-  Patterns 


/^(o. /A(i>)C))  can  be  flattened  to  fy\{a,b,c).  A  con¬ 
stant  tree  or  tree  pattern  is  normalized  if  every  sub¬ 
tree  is  flattened,  i.e.,  two  associative  vertices  with  the 
same  function  symbol  have  no  direct  parent-child  re¬ 
lationship.  For  example,  tree  t  in  Figure  1  is  nor¬ 
malized  and  tree  is  not.  Tree  iz  can  be  flattened 
as  /((7/i(hc(0, 1),  1, 0),  1).  In  the  following,  trees  or 
tree  patterns  are  all  in  the  normalized  form.  String 
patterns  can  be  viewed  as  depth- 1  tree  patterns  with 
associativity  property  on  the  root. 

5.1  Term  Matching  and  Tree  Patterns 

The  term  matching  (or  membership)  problem  is,  given 
a  tree  pattern  t  and  a  constant  tree  s,  to  decide 
whether  s  g  L{t).  Restrictions  may  be  put  on  tree 
patterns  t,  when  considering  term  matching.  Some 
standard  restrictions  are  the  number  of  occurrences  of 
each  variable,  the  number  of  variables,  and  the  proper¬ 
ties  of  associativity  and  commutativity  (e.g.  AC  term 
matching). 

Term  matching  is  one  of  fundamental  problems 
in  the  area  of  term  rewriting.  The  problem  has 
been  widely  studied  with  respect  to  trees  with¬ 
out  associative  or  commutative  properties,  as  well 
as  one-  and  two-occurrence  AC  trees.  The  non- 
AC  term  matching  problem  has  very  efficient  algo¬ 
rithm  {DKM84]  (DKSSC).  The  one-occurrence  AC 
term  matching  problem  has  been  proved  polynomial¬ 
time  solvable  (BKN85]  and  the  two-occurrence  prob¬ 
lem  (even  with  only  associativity  or  commutativity  re¬ 
striction)  has  been  proved  AP-complcte  [VR89].  In¬ 
deed  from  our  comment  about  the  relation  between 
siring  patterns  and  associative  tree  patterns,  the  NP- 
completeness  of  the  two-occurrence  associative  term 
matching  problem  also  follows  from  a  generalization  of 
Angluin’s  jVP-completeness  result  on  the  membership 
problem  for  string  patterns  (Proposition  2.2).  We  can 
also  consider  I:-variable  membership  problems  in  string 
and  tree  patterns.  For  string  patterns,  the  A’-variable 
membership  problem  for  any  A  >  1  is  in  NL  (nonde- 
tcrministic  log-space),  while  the  complexity  of  the  k- 
variable  AC  term  matching  problem  is  still  unknown 
for  any  k  >  1.  The  less  restrictive  versions  of  the  k- 
variable  associative  (or  commutative!  term  matching 
problem  is  provable  in  NL  for  any  A:  >  1.  This  further 
justifies  that  tree  patterns  with  additional  restrictions 
are  a  natural  generalization  of  siring  patterns. 

5.2  Positive  Results 

A  non-associative,  non-cominutalive  tree  is  a  tree 
such  that  all  its  internal  vertices  are  labeled  by 
non-associative,  nun  euinnmlativc  function  sjuibols. 
The  i)ositions  of  two  subtrees  of  a  vertex  of  a  non- 
associative,  non-commulative  tree  can  not  be  ex¬ 
changed  because  they  are  ordered.  The  maximum  tiee 
pattern  problem  is  solvable  in  polynomial  time  with  re¬ 
spect  to  this  type  of  trees.  We  give,  in  Figure  2,  a  pro¬ 
cedure  Max-Tree- PaT'I'EKN  which  takes  as  input  a 
set  of  non-associative,  non-commutativc  constant  trees 
and  outputs  a  maximum  common  tree  pattern. 
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procedure  MAX-TnER-PATTEKN(2i ,  'A,...,  T,) 
begin 

1.  t  *—  empty  tree) 

(t  'Prace  all  trees  Tj,  1  <  i  <  t’,  in  inorder 
simultaneously  ♦) 

2.  -while  there  are  uuvisited  vertices  do 

3.  Visit  next  unvisited  vertex; 

4.  if  any  two  of  current  vertices 

are  different 

5.  then  if  current  subtree  of  T,,  1  <  *  <  r,  is 

equal  to  subtree  pointed  by  pr„xj 

6.  then  set  the  corresponding  vertex 

of  t  to  Xj ; 

7.  else  let  sj  be  a  new  variable  and  set 

pointer  Pt„xh  1  <  t  <  r,  to  the 
corresponding  subtree  rooted 
by  current  vertex  ofT;; 

8.  Set  the  corresponding  vertex 

of  t  to  Xf, 

9.  else  set  the  corresponding  vertex  of  t  to 

the  symbol  of  current  vertices; 

10.  if  the  current  vertex  of  i  is  set  to  a  variable 

11.  then  mark  all  vertices  in  current  subtrees  as 

visited; 

end; 

12.  retnni(t); 
end. 

Figure  2:  Finding  Common  Tree  Patterns 

We  overlap  all  input  trees  Ti ,  22,  •  •  • ,  T,  and  substi¬ 
tute  variables  for  unmatched  corresponding  subtrees. 
If  two  substituted  subtrees  ate  equal  in  every  tree  Ti, 
i  ^  replaced  by  the  same  vari¬ 

able.  This  procedure  will  .terminate  since  unvisited 
vertices  decrease  by  at  least  one  for  every  iteration 
of  the  while  loop.  The  tree  pattern  t  we  get  is  the 
largest.  If  it  is  not,  i.e.,  there  exists  a  tree  pattern 
i'  such  that  U'_i{2i}  C  L{i')  and  |<'l  >  |/j,  then  at 
least  one  branch  b'  {path  from  the  root  to  a  leaf)  of 
tree  pattern  l'  is  longer  than  the  corresponding  branch 
b  of  tree  pattern  t.  If  the  end  (leaf)  of  branch  6  is  a 
variable,  then  at  least  two  Ti’s  corresponding  positions 
arc  different.  However,  the  branch  b'  is  longer  and  so 
must  have  a  function  symbol  at  this  position.  Thus 
one  of  2) ’s  cannot  be  contained  in  i(t').  If  the  end  of 
branch  b  is  a  constant,  then  all  corresponding  branches 
of  2i’s  are  shorter  than  branch  b'.  All  2i’s  are  not  in 
L(l').  Thus,  the  size  of  tree  pattern  t  is  maximum  for 
all  Tt's.  It  is  not  too  hard  to  verify  that  the  procedure 
Max-Tk EE- Pattern  runs  in  polynomial  time. 

Theorem  5.1  There  is  a  polynomial-time  algorithm 
that  find.i  ihc  maximum  tree  pattern  for  a  finite  set  of 
non-associalivc,  non -commutative  constant  trees. 

For  any  fixed  k  >  1,  we  can  also  find  the  maximum 
fc- variable  tree  pattern  by  first  constructing  the  maxi 
mum  tree  pattern  from  the  above  procedure  and  then 
simply  using  exhaustive  search  to  reduce  the  number 


of  variables  to  <  k. 

Theorem  5.2  For  any  k  >  I,  there  is  a  polynomial- 
time  algorithm  to  find  the  maximum  k-variablc  tree 
pattern  for  a  finite  set  of  non-associative,  non- 
commutative  constant  trees. 

5.3  Negative  Results 

For  AC  trees,  the  maximum  tree  pattern  problem  is 
AT^-complcte,  even  if  the  number  of  variables  is  fixed 
to  two.  However,  the  comple.xity  of  the  one-variable 
case  is  still  unknown. 

Theorem  5.3  The  following  problem  is  HT-complete: 
given  a  finite  set  S  of  associative,  commutative  con¬ 
stant  trees  and  an  integer  r,  determine  whether  there 
is  a  two-variable  tree  pattern  t  of  size  >  r  such  that 
S  C  L(l). 

Proof:  (Sketch)  We  reduce  One-1n-Three  3SAT 
with  no  negated  literals  [GJ79]  to  this  problem.  Let 
U  =  {i/i,  112, . . . ,  H,i}  be  the  set  of  variables  and 
C  —  {ci,  C2, . .  .,c„,}  be  a  set  of  clauses  each  contain¬ 
ing  three  variables.  We  say  that  C  is  satisfiable  if 
and  only  if  there  vxists  a  truth  assignment  such  that 
exact  one  variable  of  each  clause  is  assigned  to  true. 
Without  loss  of  generality,  assume  each  variable  u,-, 

1  <  t  <  71,  or  its  negation,  occurs  at  least  once  in 
some  Cj.  1  <  j  <  m.  Let  r  -  271-1-4.  For  each  clause 
c,-  =  »,-,,7/,-,},  1  <  i  <  771,  we  generate 

-  fAc{t[ii,i2,h],t[i2,h,h],t[ie‘  h.ie]) 

where  tij, A’,/]  is  a  depth-1  tree  with  root  hjic  and  • 
27?  -  3  leaves:  {(7m(0),  (;,„(1)  :  in  <f.  {j,k,l}}  U 
{fli(l).!7fc(0),.'7i(0)}.  We  also  define 

-  fAc[t\,l2,  t'i), 

where  I,,  j  =  1, 2, 3,  is  a  depth-1  tree  with  a  root  Iiac 
and  277  -!-  1  leaves:  {'7m(0),  <7,„(1)  •  ^  <  m  <  7i}  U 

{hm- 

Let  5  =  So  U  (U"ii{s;}).  We  want  to  prove  that  C 
is  satisfiable  by  r  if  and  only  if  there  is  an  associative, 
commutative  tree  pattern  t  of  size  r  =  27i-l-4  such  that 
S  C  L(t).  For  the  forward  direction,  we  can  see  that 
tree  pattern 

fAc{hAc(9i{(ii),...,gn{a„),x.),y), 
is  suitable,  where  x  and  y  are  variables  and  a,  —  1  if 
t(h,-)  =  true. 

For  the  backward  direction,  a  two-variable  tree  pat¬ 
tern  t  whose  L{t)  contains  5  cannot  be  of  the  form. 

fAc{hAc{- ■  ■)>  h^c(-  ■  ■},b) 

for  any  varirible  b  or  a  tree  b  tooted  with  /lyic;  oth¬ 
erwise  t  must  contain  at  least  three  variables  because 
each  subtree  A/c  of  So  has  a  difierent  function  symbol 
/j.  Therefore,  the  only  suitable  tree  patterns  of  size 
>  27?  -F  ‘1  are  of  the  following  form: 

fA(  ihAc{<Ji{ni),  ■  •■,<;„(«, i),x),y), 
where  u,  t  {0,1}  for  I  t  n.  Since  each  variable 
u,  appi.irs  at  least  omi  in  some  vlause  <-j,  example  Sj 
guatanlves  that  only  one  subtree  of  Iiac  rooted  by 
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Qi.  We  define  r{u,)  —  true  if  the  argument  a,  of  g,  is 

1.  a 

The  above  proof  can  be  extended  to  the  cases  fc  >  3. 

Corollary  5.4  For  any  k  >  2,  the  followwg  problem 
is  NP -complete:  given  a  finite  set  S  of  associative, 
commutative  constant  trees  and  an  integer  r,  deter¬ 
mine  whether  there  is  a  k-variablc  tree  pattern  t  of 
size  >  r  such  that  S  C  L{t). 

6  Conclusions 

We  have  investigated  pattern  learning  problems  in 
three  new  directions.  Recently,  we  characterize  the 
precise  complexity  of  the  problem  FCP  to  be  S?- 
complete.  Many  problems  mentioned  in  our  paper  re¬ 
main  open.  We  restate  them  here. 

(1)  For  any  fixed  k  >  2,  find  longest  A*- variable  pat¬ 
terns  for  positive  examples. 

(2)  For  any  fixed  fc  >  1,  find  Ai-variablc  patterns  for 
positive  and  negative  examples. 

(3)  Find  longest  one-variable  patterns  for  incomplete 
positive  examples. 

(4)  Find  one-variable  AC  tree  patterns  for  positive  and 
negative  constant  trees. 
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Abstract 

Analog  neural  networks  of  limited  precision 
are  essentially  A-ary  neural  networks.  That 
is,  their  processors  classify  the  input  space 
into  k  regions  using  ib  —  1  parallel  hyper¬ 
planes  by  computing  ifc-ary  weighted  multilin- 
ar  threshold  functions.  The  ability  of  Jfc-ary 
leural  networks  to  learn  ib-ary  weighted  mul¬ 
tilinear  threshold  functions  is  examined.  The 
well  known  perceptron  learning  algorithm  is 
generalized  to  a  ib-ary  perceptron  algorithm 
with  guaranteed  convergence  property.  Lit- 
tlestone’s  winnow  algorithm  is  superior  to  the 
perceptron  learning  algorithm  when  the  ra¬ 
tio  of  the  sum  of  the  weights  to  the  threshold 
value  of  the  function  being  learned  is  small. 

A  i-ary  winnow  algorithm  with  a  mistake 
bound  which  depends  on  this  value  and  the 
ratio  between  the  largest  and  smallest  thresh¬ 
olds  is  presented. 

1  Introduction 

In  (Obradovic  &  Parberry,  1990)  it  was  shown  that 
analog  neural  networks  of  limited  precision  are  essen¬ 
tially  l:-ary  neural  networks  (that  is,  their  processors 
classify  R”  into  k  regions  using  k  -  I  parallel  hy¬ 
perplanes)  and  their  computing  power  was  examined. 
Here,  we  investigate  their  learning  power.  One  of  the 
results  from  that  reference  was  that  there  is  no  canoni¬ 
cal  set  of  threshold  values  for  a  ib-ary  perceptron  when 
k  >  3,  although  they  exist  for  binary  and  ternary 
neural  networks.  This  indicates  that  learning  algo¬ 
rithms  for  ib-ary  neural  networks  which  modify  only 
the  weights  are  not  neces-’ary  convergent.  Here  we 
show  that  matters  can  be  in.proved  by  learning  both 
the  thresholds  and  the  weights.  A  preliminary  version 
of  the  results  from  this  paper  appear  in  (Obradovic  & 
Parberry,  1989). 

The  main  body  of  this  paper  is  divided  into  three 
sections.  The  first  section  sketches  definitions  of 
the  k-axy  neural  network  model  and  learning  in  that 
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model.  The  second  section  contains  a  k-aty  percep¬ 
tron  learning  algorithm  (derived  from  the  binary  per¬ 
ceptron  learning  algorithm)  and  its  convergence  proof. 
The  third  section  contains  a  ib-ary  winnow  algorithm 
(derived  from  Littlestone’s  winnow  algorithm  (Little- 
stone,  1987,  1989))  and  its  mistake  bound  proof. 

2  A  General  Framework  for  Learning 

Let  1:  6  N,  and  Z)t  =  {0, . , . ,  ib  —  1}.  A  k-ary  neural 
architecture  is  a  ib-ary  neural  network  with  the  weights, 
thresholds  and  initial  activation  levels  left  unspecified. 
That  is,  it  is  a  4-tuple  A  =  (ib,  V,  /,  O),  where: 

I:  E  N  is  the  number  of  logic  levels, 

V  is  a  finite  set  of  processors,  or  gotes, 

/  C  K  is  a  set  of  input  processors, 

O  C  K  is  a  set  of  output  processors. 

A{a,w,h)  denotes  the  k-ary  neural  network 
{k,  V,  I,  O,  a,  w,  h),  where 

o  :  V  -  /  -+  Zjfc  is  a  set  of  initial  activation  levels, 
ucyxY-^Risa  weight  assignment, 
h:V  -*  R*“^  is  a  threshold  assignment. 

The  processors  of  a  k-ary  neural  network  are  rela¬ 
tively  limited  in  computing  power.  Processor  v  €  V 
has  k  -  1  thresholds  hi{v),...,hii^i{v),  and  if  its 
weighted  input  sum  is  between  hi{v)  and  /ij+i(u)  it 
has  i  for  output. 

More  formally,  a  k-ary  function  is  a  function  /  : 
Z]J  — »  Zi.  Let  denote  the  set  of  all  n-input 
k-aty  functions.  Define  ©jj  :  by 

Q^{wi,...,w„,hi,...,hk-i)  :R^  -^2,k,  where 

©J(un, .  ..,Wn,hi,...,  hk-i){xi,.  ..,x„)  =  i 

n 

iff  h:  <  ^^WiX:  <  /lj.1.1. 
f=l 

Here  and  throughout  this  paper,  we  will  assume  that 
hi  <  h2  <  ...  <  hk-i,  and  for  convenience  define 
ho  =  -oo  and  hk  —  oo.  The  set  of  k-ary  weighted 
multilinear  threshold  functions  is  the  union,  over  all 
n  G  N,  of  the  range  of  0J.  Each  processor  of  a  ib-ary 
neural  network  can  compute  a  ib-ary  weighted  n.ulti- 
linear  threshold  function  of  its  inputs. 
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Let  A  -  and  /  =  (/i,/2,...)  where 

A„  =  {k,Vn,In,On)  is  a  neural  architecture  with 
l|i||  =z  n,  and  ;  R”  — »  Zt.  A  learning  algorithm 
for  /  on  A  is  a  relativized  algorithm  L  with  an  ora¬ 
cle  for  /  which  on  input  n  outputs  a  series  of  distinct 
initial  activation,  weight,  and  threshold  assignments 
{ao,wo,ho),  {ai,wi,hi),...,{ai,wt,ht)  such  that  the 
neural  network  Mn  =  A„(a<, u)t, ht)  computes  /„.  We 
will  consider  learning  algorithms  for  k-ary  weighted 
multilinear  threshold  functions  on  neural  circuits  (that 
is,  layered  neural  networks  without  feedback). 

Resources  of  interest  include  those  of  Mn,  and  those 
of  L.  The  former  include  the  size  (number  of  proces¬ 
sors),  depth  (number  of  layers),  and  weight  (sum  of  all 
the  weights)  of  the  circuit.  The  latter  include  the  la¬ 
tency  and  the  mistake  bound,  defined  as  follows.  The 
latency  of  learning  algorithm  is  the  worst  case  run¬ 
ning  time  between  the  output  of  one  set  of  assignments 
and  the  next.  We  will  measure  unit-cost  latency,  that 
is,  we  will  assume  that  L  is  implemented  on  a  digi¬ 
tal  computer  with  word-size  large  enough  that  each 
elementary  arithmetic  and  logic  operation  can  be  im¬ 
plemented  in  constant  t'mc.  The  mistake  bound  is  the 
worst  case  total  number  of  distinct  assignments  out¬ 
put. 

If  /  is  a  ib-ary  weighted  multilinear  threshold  func¬ 
tion,  we  say  that  6  R"+*“‘ 

is  Sl  representation  of  f  iff 

/  =  ej(iui,...,ty„,<i . tk-i). 

Note  that  each  ifc-ary  weighted  multilinear  thresh¬ 
old  function  has  many  representations.  We  will  con¬ 
sider  the  problem  of  learning  it-ary  weighted  multilin¬ 
ear  threshold  functions,  and  for  the  most  part  be  con¬ 
cerned  with  learning  them  on  the  minimal  architecture 
consisting  of  a  single  ib-ary  processor.  That  is,  we  will 
be  learning  a  representation  for  a  I:-ary  weighted  mul¬ 
tilinear  threshold  function  /,  given  only  an  oracle  for 
/.  All  of  our  learning  algorithms  will  be  expressed  in 
a  high-level  pseudocode.  Initial  activation  icvcia  will 
always  be  zero. 

We  will  consider  two  new  resources,  called  the  height 
and  width  of  a  representation,  which  give  some  indi¬ 
cation  of  the  relationships  between  the  weights  and 
the  thresholds,  and  between  the  thresholds  themselves 
(respectively),  to  be  defined  later. 

3  A  k-ary  Perceptron  Learning  Rule 


The  perceptron  learning  problem  is  the  problem  of 

icaiiitiig  uiiiai  j  TvCi^iivCvt  liiicai  luiiLtiOiic  Wxl 

a  binary  neural  network  consisting  of  a  single  proces¬ 
sor  (called  a  perceptron  for  historical  reasons).  There 
is  a  well-known  algorithm  for  the  perceptron  learning 
problem  which  uses  the  so-called  perceptron  learning 
rule  to  derive  successive  weights.  The  algorithm  is 
described  in  Figure  1. 


Theorem  3.1  (The  Perceptron  Convergence  Theo¬ 
rem)  The  perceptron  learning  algorithm  for  learning 
n-input  binary  weighted  linear  threshold  functions  de¬ 
scribed  in  Figure  1  terminates. 


Figure  1:  The  Perceptron  Learning  Algorithm. 


Proof:  See,  for  example,  (Duda  k  Hart,  1973),  (Min¬ 
sky  k  Papert,  1969),  (Nilsson,  1962)  or  (Novikoff, 
1962).  □ 


The  initial  weights  can  be  set  to  any  value  in  the 
for-loop  on  line  1  in  Figure  1.  The  members  of  Zg 
can  be  used  in  any  order  in  the  for-loop  on  line  3, 
provided  every  member  is  used  an  infinite  number  of 
times  in  the  algorithm.  Also,  the  value  added  to  tu,- 
in  the  for-loop  of  perceptron.update  procedure  can  be 
multiplied  by  some  constant  Cj  E  R+  at  the  call  of 
that  procedure  provided 

m 

lim  Cj  =  00 

m— CO 

i=l 


and 


m 


1=1 


Furtherr  -e,  the  algorithm  will  learn  any  weighted 
linear  t'  old  function  whose  domain  is  some  finite 

subset  c  .  We  will  make  use  of  this  fact  later.  The 

latency  o.  the  algorithm  is  clearly  linear  in  n.  The 
worst-case  mistake  bound  appears  no  better  than  ex¬ 
ponential  in  n. 


The  miniiital  architecture  for  learning  n-input  k-ary 
weighted  multilinear  threshold  functions  is  a  single  I:- 
ary  weighted  multilinear  threshold  gate  with  n  inputs, 
which  we  will  call  a  k-ary  perceptron.  It  was  shown  in 
(Obradovic  k  Parberry,  1990)  that  there  is  no  canoni¬ 
cal  set  of  threshold  values  for  a  k-avy  perceptron  when 
k  >  Z.  This  suggests  that  the  thresholds  must  be 
learned  in  addition  to  the  weights. 

Even  if  the  threshold  values  are  known  in  advance, 
many  obvious  extensions  to  the  perceptron  learning 
rule  (such  as  that  shown  in  Figure  2)  which  modify 
only  the  weights  do  not  necessary  terminate  for  all 
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procedure  thresholdp€rceptron{n, 
for  i  :=  1  to  n  do  Wi  :=  0; 
repeat 

for  each  a:  6  ZjJ  do 

p  :=  ejf (toi, . tfc-i)(a:) 

if  f{x)  ^  p 

then  thresholdperceptron.update{x,  p); 
Output  {wi, . .  .,w„) 

until  /(x)=  e^(wi,...,Wn,ti,...,tk.i){x) 
for  all  ®  6  ZJJ. 

pro cedure  thresholdpe rcept ron.update{x , p) 
if  f{x)  >  p  then  sign  :=  1 
else  sign  :=  —  1; 

for  i  :=  1  to  n  do  w,-  :=  w,-  +  sign  *Xi; 


Figure  2:  A  Trial  l:-ary  Perceptron  Learning  Algo¬ 
rithm  for  Known  Thresholds. 

choices  of  ordering  of  sample  inputs  in  line  4.  For 
example,  suppose  k  =  2  and  n  =  2  (similar  exam¬ 
ples  can  be  found  for  arbitrary  n  and  Jk  using  the 
same  principles).  Consider  /  =  0|(4,3,7,8).  Sup¬ 
pose  we  use  the  algorithm  described  in  Figure  2  to  find 
weights  wi,W2  such  that  ©^(wi,W2,7,8)  =  /.  After 
considering  points  (2,0)  and  (2,2),  we  have  weights 
(wi,  W2)  =  (4, 2).  All  points  are  correctly  classified  us¬ 
ing  these  weights  except  for  the  point  (1, 1).  Thus 
there  is  no  change  to  the  w’eights  until  point  (1, 1) 
is  considered,  at  which  time  the  new  weights  become 
(5,3).  Once  again,  all  points  are  correctly  classified 
using  these  weights  except  for  the  point  (1, 1).  Thus 
there  is  no  change  to  the  weights  until  point  (1, 1)  is 
reconsidered,  at  which  time  the  new  weights  are  again 
(4, 2).  Thus  the  weights  cycle  between  (4, 2)  and  (5, 3) 
without  ever  reaching  an  acceptable  solution.  Matters 
are  not  improved  by  making  obvious  changes  to  Figure 
2,  for  example,  instead  of  adding  a  multiple  of  x,-  in 
the  for  loop  of  thresholdperceptron.update  procedure, 
substituting  one  if  x,-  >  0  and  zero  otherwise. 

However,  matters  can  be  improved  by  learning  both 
the  thresholds  and  the  weights. 

It  can  be  shown  from  first  principles  that  the  ik-ary 
perceptron  learning  algorithm  described  in  Figure  3 
terminates.  Termination  can  more  easily  be  proved  as 
a  corollary  of  the  Perceptron  Convergence  Theorem  as 
follows. 

Definition.  If  /  is  an  n-input  ik-ary  weighted  multi¬ 
linear  threshold  function,  the  orthogonal  slice  function 
for  /  is  a  binary  weighted  linear  threshold  function  g 
such  that  for  all  a:  €  Zg  and  all  i  €  Z* , 

/(x)  >  i<i>- g{x,y{i))  =  1, 
where  y  :  {1, . . . ,  I:  —  1}  -+  {0, 1}*”*  is  defined  by 
y{i)  =  (j/i, •  •  • . Vk-i)  with  yj  -  1  \ft  j  =  i. 


procedure  multiperceptron{n,  k) 
for  i  :=  1  to  n  do  Wi  :=  0; 
for  j  :=  1  to  ik  —  1  do  U  :=  0; 
repeat 

for  each  x  =  (xi . x„)  G  Zg  do 

p:=  ©]J(u;i . w„,ti . tit-i)(a:); 

if  /(x)  ^  p 

then  multiperceptron.update[x, p)\ 

Output  (u;i,...,u;„,ii . tk-i) 

until  /(x)  =  e^{wu...,Wn,ti,...,tk-i){x) 
for  all  X  G  ZJ . 

procedure  multiperceptron.update{x,  p) 
if  /(x)  >  p 

then  <p  -f  1  :=  fp  -fl  —  1;  sign  :=  1 
else  <p  :=  fp  -1- 1;  sign  :=  -1; 
for  i  ;=  1  to  n  do  w,  :=  Wi  -1-  sign  *  x,-; 


Figure  3:  The  k-aiy  Perceptron  Learning  Algorithm. 

Lemma  3.2  /= ©2 (till . Wn,ti . tjt-i)  iff 

©2+‘-‘(u>i,...  ,Wn,—ti . — tt_i,0)  is  the  orthogo¬ 

nal  slice  function  for  f. 

Proof:  Follows  immediately  from  the  definition  of  the 
orthogonal  slice  function  □ 

Theorem  3.3  (The  k-ary  Perceptron  Convergence 
Theorem)  The  k-dry  perceptron  learning  algorithm  for 
learning  n-input  k-ary  weighted  multilinear  threshold 
functions  described  in  Figure  3  terminates. 

Proof:  (Sketch)  The  learning  algorithm,  instead  of 
learning  f,  learns  the  orthogonal  slice  function  for  / 
using  the  binary  perceptron  learning  algorithm  shown 
in  Figure  1.  The  orthogonal  slice  function  for  /  is  guar¬ 
anteed  to  exist  by  (the  “only-if”  part  of)  Lemma  3.2. 
Once  the  orthogonal  slice  function  has  been  learned,  / 
can  be  reconstructed  using  (the  “if”  part  of)  Lemma 
3.2.  The  algorithm  is  guaranteed  to  terminate  by  the 
Perceptron  Convergence  Theorem.  It  is  clear  that  the 
algorithm  realizes  Figure  3.  □ 

The  latency  of  the  k-ary  perceptron  learning  algo¬ 
rithm  is  0(n),  and  the  mistake  bound  is  no  worse  than 
for  the  binary  perceptron  learning  algorithm  on  n  -f  k 
inputs. 

A  second  candidate  architecture  for  learning  k-ary 
weighted  multilinear  threshold  functions  consists  of 
pog  k]  binary  perceptrons,  the  i*'*  of  which  learns  the 
bit  of  the  output  value,  together  with  a  single  k- 
ary  weighted  multilinear  threshold  gates  with  expo¬ 
nentially  increasing  weights  which  converts  the  binary 
output  of  these  gates  into  the  corresponding  member 
of  Z*.  Unfortunately  this  cannot  work  because  the  last 
binary  perceptron  is  expected  to  learn  the  least  signif¬ 
icant  bit  of  the  k-ary  output,  which  is  not  necessarily 
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procedure  uetperceptron{n,  k) 
for  t  :=  1  to  ik  —  1  do 
for  j  :=  1  to  n  do  Wij  :=  0; 
repeat 

for  each  r  =  (ii , . . . ,  a:„)  €  Zg  do 
for  each  i  €  Zt  do  neuron-update{x,  i) 
Output  {u)l,l,...,Wk-l,n) 
until  (  /(i)  >i)^  (©2 (’"•M.  ■  ■  ■  I  0)(a:)  =  1) 
for  all  *  €  ZJ  and  1  <  i  <  ik  —  1. 

procedure  neuron.update{x,  i) 

p:=  e^(wi,i . u;,-,„,0)(x); 

if  (f(x)  >  i)  and  (p  =  0)  then  sign  :=  1 
else  if  (/(*)  <  i)  and  (p  =  1) 
then  sign  :=  —  1 
else  sign  :=  0; 

for  j  :=  1  to  n  do  Wij  :=  Wij  +  sign  *  x,-; 


Figure  4;  The  ik-ary  Perceptron  Learning  Algorithm 
for  a  Depth  2  Circuit. 


a  binary  weighted  threshold  function. 

A  third  candidate  architecture  for  learning  k-ary 
weighted  multilinear  threshold  functions  consists  of  a 
depth  2  circuit  of  size  k.  The  first  layer  consists  of  ik  - 1 
binary  perceptrons,  each  connected  to  all  of  the  inputs. 
The  second  layer  consists  of  a  single  k-ary  perceptron, 
connected  to  all  of  the  gates  in  the  first  layer.  The 
thresholds  of  the  first  layer  are  all  zero.  The  thresh¬ 
olds  of  the  ik-ary  perceptron  are  1, 2, . . . ,  fc  -  1.  The 
weights  of  the  connections  from  the  first  layer  to  the 
second  are  all  one.  The  weights  of  the  connections 
from  the  inputs  to  the  first  layer  will  be  learned. 

Let  Wij  denote  weight  from  the  input  to  the 
gate  on  the  first  layer,  where  1  <  i  <  k  —  1  and 
I  <  j  <  n.  Suppose  we  are  to  learn  a  k-ary  weighted 
multilinear  threshold  function  /.  The  first  level  essen¬ 
tially  computes  the  orthogonal  slice  function  for  /,  and 
the  second  level  converts  this  to  a  value  from  Z* .  More 
precisely,  the  gate  on  the  first  level,  1  <  i  <  A:  -  1, 
will  output  one  on  input  x  iff  f{x)  >  i.  This  implies 
that  exactly  /(x)  of  the  gates  in  the  first  layer  will  be 
active.  The  output  gate  sums  the  number  of  active 
gales  in  the  first  layer. 

It  is  clear  that  by  performing  the  binary  perceptron 
learning  rule  in  parallel  for  all  A:  -  1  gates  in  the  first 
layer,  the  network  will  learn  arbitrary  k-ary  weighted 
multilinear  threshold  functions.  The  learning  algo¬ 
rithm  is  described  in  Figure  4.  It  has  a  latency  of 
0{nk).  Its  mistake  bound  may  be  better  than  that  of 
Figure  3  in  practice  since  it  learns  arbitrary  separat¬ 
ing  hyperplanes,  rather  than  parallel  ones.  However, 
the  worst  case  mistake  bound  remains  apparently  ex- 
ponential» 


4  A  k-ary  Winnow  Algorithm 

A  representation  {wi, . .  .,Wn,ti, . .  .,tk-i)  € 
of  a  k-ary  weighted  multilinear  threshold  function  is 
positive  iff  tn,-  >  0  for  all  1  <  i  <  n.  A  k-ary  weighted 
multilinear  threshold  function  is  positive  iff  it  has  a 
positive  representation. 

A  positive  representation  {wi,...,w„,t],,...,tk-i) 
has  separation  X  e  R+,  0  <  A  <  1,  if  for  all  x  = 
(xi,...,x„)  e  ZJ,  and  all  t  6  Zjt,  i  <  A:  -  1, 

n 

/(x)  <  i  iff  <  (A  -  ^)A.+1- 

i=i 

The  width  of  a  representation 

(u^l,  .  .  .  ,  U/fJ,  t\y  .  .  .  — 

is  the  ratio  tk-i/h-  Note  that  all  representations  of  a 
binary  weighted  linear  threshold  function  have  width 
one.  A  k-ary  weighted  multilinear  threshold  function 
has  width  (at  most)  d  iff  it  has  a  representation  of 
width  d. 

The  height  of  a  representation 

(wiy  .  .  .  ,  Wfiy  ,  .  .  ,  ,  tk—x) 

of  width  d  is  the  ratio 


A  ik-ary  weighted  multilinear  threshold  function  has 
height  (at  most)  h  iff  it  has  a  representation  of  height 
h.  A  A:-ary  weighted  multilinear  threshold  function  is 
(A,  h,  d)-separable  if  it  has  a  positive  representation  of 
separation  A  >  0,  height  h,  and  width  d.  Since  all 
binary  weighted  linear  threshold  functions  have  width 
1,  we  will  write  (A, /j)-separable  when  k  =  2. 

When  learning  weighted  linear  threshold  functions, 
we  can  without  loss  of  generality  restrict  ourselves  to 
learning  positive  ones.  If  we  need  to  learn  a  weighted 
linear  threshold  function  with  negative  weights,  we  can 
substitute  a  positive  function  of  the  same  height,  and 
perform  a  minimal  amount  of  pre-processing  of  the 
inputs: 

Lemma  4.1  For  each  representation  {vi,...,v„,t) 
of  height  h  there  exists  a  positive  representation 
{wi,  ...,w„,r)  of  height  h  and  a  function  :  Zg  -+  Zg 
of  the  form  g{xi,...,Xn)  =  (yi,...,yf>)  where  for 
1  <  i  <  n,  ,  either  j/,-  =  x,-  or  j/,-  =  xi,  such  that 
for  all  X  G  Z2, 

f{x)=e^{wi,...,Wn,r){g{x)) 

Proof:  We  make  use  of  an  elementary  technique  due 
to  (Muroga,  1971)  (see  also  Theorem  4.5.2  of  (Par- 
berry,  1990)).  Suppose 

/(x)  =  05(t;i,...,i;„,i) 

Then 

Wi  =  I  Vi  I  for  1  <  I  <  n, 
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and 

n 

r  =  t  +  53(|  V,-  |-u,-)/2, 

1=1 

and  y,-  =  0.5  +  (a:,-  —  0.5)vi/|  v,-  |.  The  new  represen¬ 
tation  has  height  at  most  h  since  its  denominator  is 
Isurger  than  that  of  the  original  representation,  whilst 
its  numerator  is  the  same.  O 

The  threshold  value  in  a  positive  representation  is 
not  important. 

Lemma  4.2  For  each  positive  representation  {vi, 
...,Vn,t)  of  height  h  and  separation  X  with  <  >  0, 
and  all  r  €  R+,  there  is  a  positive  representation 
(wi,  ...,Wn,r)  of  height  h  and  separation  X  such  that 

05(vi, ....  v„,  <)=  e5(u;i, ... ,  Wn,  r). 

Proof:  Set  tn,-  =  Vir/t  for  1  <  i  <  n.  □ 

Littlestone  (1987,  1989)  proposed  a  learning  algo¬ 
rithm,  called  the  winnow  algorithm  (see  Figure  5) 
for  learning  n-input  binary,  positive,  (A,/i)-separable 
functions  on  a  single  n-input  perceptron.  The  algo¬ 
rithm  takes  as  parameter  a  constant  a,  and  learns  a 
positive  representation  with  threshold  value  n  (such 
a  representation  exists,  by  Lemma  4.2).  The  latency 
of  the  binary  winnow  algorithm  is  clearly  linear  in  n. 
The  mistake  bound  is  given  by  the  following  theorem. 

Theorem  4.3  If  a  =  0.5A  +  1,  then  the  number  of 
mistakes  made  by  the  winnow  algorithm  in  Figure  5 
learning  an  n-input,  positive,  (A,  h)-separable  weighted 
linear  threshold  function  is  at  most 


Proof:  See  (Littlestone,  1987).  □ 

Later  we  will  use  the  fact  (Littlestone,  1989)  that 
the  winnow  algorithm  will  learn  any  n-input,  positive, 
(A,/i)-separable  weighted  linear  threshold  function 
whose  domain  is  some  finite  subset  of  {{0}  U[6, 1]}". 
In  that  case,  the  mistake  bound  from  Theorem  4.1  de¬ 
pends  on  log(n/6)  instead  of  logn. 

The  mistake  bound  is  a  significant  improvement  over 
the  binary  perceptron  learning  algorithm  foi  weighted 
linear  threshold  functions  with  large  separation  and 
small  height.  In  contrast,  the  best  known  mistake  up¬ 
per  bound  for  the  perceptron  learning  algorithm  (see, 
for  example,  Duda  &  Hart,  1973)  is  polynomial  in  the 
weight  of  the  best  representation  (if  it  is  sufficiently 
large).  It  is  known  (see,  for  example,  Muroga,  1971; 
Parberry,  1990)  that  there  are  weighted  linear  thresh¬ 
old  functions  for  which  the  weight  of  the  best  rep¬ 
resentation  is  at  least  exponential  in  n,  and  it  can 
be  deduced  that  there  are  functions  with  exponential 
weight,  polynomial  height  and  inverse-polynomial  sep¬ 
aration.  Whilst  the  perceptron  learning  algorithm  ap¬ 
pears  to  make  exponentially  many  mistakes  for  these 


procedure  winnow{n,  a) 
for  »■  :=  1  to  n  do  Wi  :=  1; 
repeat 

for  each  a:  =  (®i, . . . , *„)  6  ZJ  do 
p:=  05(u)i,...,u;„,n)(®); 
if  f{x)  ^  p 

then  winnow jupdate{x,p,a)‘, 
Output  {wi,...,w„,n) 

until  /(*)  =  0?(tyi . Wn,tXx) 

for  all  X  G  ZJ. 

procedure  winnow.update{x,p,  a) 
if  /(x)  >  p  then  6  a 
else  6  :=  1/a; 
for  t  :=  1  to  n  do  Wi  :=  Wi 


Figure  5:  The  Winnow  Learning  Algorithm. 


functions,  the  winnow  learning  algorithm  makes  only 
polynomially  many  mistakes.  If  A  and  h  are  constant, 
only  0(log  n)  mistakes  are  made. 

In  the  light  of  Lemma  4.1,  the  definition  of  {X,h)- 
separability  can  be  extended  to  non-positive  weighted 
linear  threshold  functions  as  follows.  A  representation 
(tui,...,tUn,t)  has  separation  X  6  R+,  0  <  A  <  1,  if 
forallx  =  (xi . x„)eZ2, 

/(x)  <  0  iff 

^  I  W;  I  X;  <  (1  -  A)(f  +  5]  (I  Wj  I  -Wj)/2). 
j=i  j=i 

Hence  we  have: 


Corollary  4.4  Any  n-input  {X,h)-separable  weighted 
linear  threshold  function  can  be  learned  on  a  neural 
circuit  of  depth  S  and  size  at  most  n  with  latency  0(n) 
and  mistake  bound 


fUlogn  ,  5\ 

V  A2  ■^aJ 


Proof:  The  result  is  an  immediate  consequence  of 
Lemma  4.1  and  Theorem  4.3.  O 


We  extended  the  winnow  algorithm  to  I:-ary 
weighted  multilinear  threshold  functions.  The  k-ary 
winnow  algorithm  for  learning  n  input,  k-aty,  posi¬ 
tive,  (A,  h,  d)  separable  function  on  a  single  k-ary  per¬ 
ceptron  with  n  inputs  is  described  in  Figure  6. 

The  latency  of  the  algorithm  is  0{n  +  k).  To  prove 
the  mistake  bound  we  will  use  a  new  slice  function 
which  preserves  positiveness. 

Definition.  If  /  is  an  n-input  I:-ary  weighted  multi¬ 
linear  threshold  function,  the  unary  slice  function  for 
/  is  a  binary  weighted  linear  threshold  function  g  such 
that  for  all  x  G  ZJJ  and  all  i  G  Z*, 

/(x)  >  i  <t>y(x,y(i))  =  1, 
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procedure  multiwinnowJearning{n,  k,  a) 
for  i  :=  1  to  n  +  ife  —  2  do  Wi  :=  1; 
tk-i  =  {k-l){n  +  k-2y, 
for  i  :=k  —  2  downto  1  do  <,■  =  U+i  —  1; 
repeat 

for  each  x  =  Xn)  €  ZJ  do 

p  :=  e2(«;i,  ti ,  <2,  ,  tib-i)(a:); 

/(*)  ^  P  then 
multiwinnow.update{x,  p, 

Output  {U)u...,w„,ti,t2,.  ..,tk-l). 
until  /(x)  =  ©g(wi,...,io„,ti,...,<t_i)(*) 
for  all  a:  €  ZjJ 

procedure  multiwinnow.update{x,p,  a) 
for  i  :=  1  to  n  do  zi  :=  x,-; 
for  i  :=  1  to  ib  —  2  do  Zn+i  :=  0; 
if  /(x)  >  p  then  6  :=  a;  ind  :=  p  +  1 
else  6  :=  1/a;  ind  :=  p; 
for  i  :=  ind  to  ib  —  2  do  2„+,-  ;=  1; 
for  t  :=  1  to  n  +  ib  -  2  do  lUj  :=  WiS^'] 
for  i  :=  k  —  2  downto  1  do  =  tj+i  —  Wn+n 


Figure  6:  The  l:-ary  Winnow  Learning  Algorithm. 


rfhere  y  ;  {1, . . . ,  1:  -  1}  {0, 1}*“^  is  defined  by 

y(»)  =  (yi ,  •  •  • .  yjt-2)  with  pj  =  1  iff  j>  i. 


Lemma  4.5  If  (vi,...,v„,ti,.,.,tk-i)  is  a  positive 
representation  of  height  h,  width  d,  and  separation  X 
of  a  k-ary  weighted  multilinear  threshold  function  f, 
then  {Wi,  . . .  ,Wn,t2  -  tifts  -  t2, . . .  ,tk-l  -tk-2,tk-l) 
is  a  positive  representation  of  height  h  and  separation 
X/d  of  the  unary  slice  function  for  f . 

Proof:  Follows  immediately  from  the  definition  of  the 
unary  slice  function.  O 


Theorem  4.6  If  a  =  1  +  X/(2d),  then  the  num¬ 
ber  of  mistakes  made  by  the  k-ary  winnow  algorithm 
in  Figure  6  learning  an  n-input,  positive,  {X,h,d)- 
separable  k-ary  weighted  multilinear  threshold  function 
is  bounded  above  by 


/'l4d^log((fc  -  l)(n  + 1: -2))  .  5ci^ 


(k  -  i)/t  + 


8d2 
A2  ■ 


Proof:  (Sketch)  By  Lemma  4.5  the  learning  algo¬ 
rithm,  instead  of  learning  a  positive  representation 

{wi, . . .  ,W„,ti,t2, , . . .  ,tk-l) 

of  height  h  and  separation  A  for  a  k-ary  v/eighted  mul¬ 
tilinear  threshold  function  /  ,  can  learn  a  positive  rep¬ 
resentation 

{wi,. . .  ,Wn,t2  —  tiyts  —  t2, ,  ■  ■  ■ ,  tk-l  -  tk-2,  ti-l) 


of  height  h  and  separation  A*  =  A/d  of  the  unary 
slice  function  for  /.  Since  inputs  (xi,...,x„)  for 
function  /  are  from  ZJ,  we  can  not  apply  the  bi¬ 
nary  winnow  learning  algorithm  directly  to  learn 
the  unary  slice  function  for  /  on  inputs  (x,y(i))  = 
(xi,...,x„,yi,..,,yjfc_i).  But,  we  can  use  the  binary 
winnow  learning  algorithm  with  learning  parameter  a 
and  threshold  t  =  n  +  k  — 2  to  learn  a  modified  unary 
slice  function 

(tUl . Wn,t2-tl,t3-t2,,...,tk-l-tk-2,tk-l/ik-l)) 

on  compressed  inputs 

(xi/(lr-l), . . . ,  x„/(l:- 1),  yi/{k-l) . yk-2/ik-l)' 

from  {{0}U  [!/(<: -1),!]}". 

Finally,  observe  that 

and  also 

12  + 12  “'"+•1^  <{n-kk-2) 

i=l  <=1 

iff 

n  k-2 

^  WiXi  +  Yl  Wn+m  <{k-  l)(n  +  k-  2). 
i=l  «=1 

So,  for  learning  we  can  actually  use  binary  winnow  al¬ 
gorithm  with  learning  parameter  and  thresh¬ 

old  t  =  (^  -  l)(n  +  ib  -  2)  on  the  original  inputs 
(xi, . .  .,x„,yi, . .  .,yjfc-i).  It  is  easy  to  see  that  this  al¬ 
gorithm  realizes  Figure  6.  Substituting  a  =  1  X*/2, 

t  =  n-{-  k  —  2  and  6  =  l/{k  —  1)  in  the  Theorem  4.3 
we  obtain  the  mistake  bound  from  the  claim  of  this 
theorem.  O 

The  mistake  bound  of  the  A-ary  winnow  algorithm 
is  a  significant  improvement  over  the  ib-ary  percep- 
tron  learning  algorithm  for  (A,  ft,  d)-separable  func¬ 
tions  when  A  is  large  and  ft  and  d  are  small. 

A  slightly  better  mistake  bound  can  be  obtained  if 
the  input  x  €  ZjJ  is  encoded  in  binary  as  x*  G 
and  the  ft-ary  winnow  algorithm  is  used  on  a  ft-ary  per- 
ceptron  with  n  log  k  binary  inputs  instead  of  n  ft-ary 
inputs.  Instead  of  learning  ib-ary  weighted  multilin¬ 
ear  threshold  functions,  we  can  substitute  binary-to- 
k-ary  weighted  multilinear  threshold  functions,  which 
are  simply  ft-ary  weighted  multilinear  threshold  func¬ 
tions  whose  domain  is  restricted  to  Z",  and  per¬ 
form  a  small  amount  of  pre-processing  of  the  in¬ 
puts.  Without  loss  of  generality,  we  will  henceforth 
assume  that  ib  is  a  power  of  two.  Define  function 
Encode  :  ZJ  —*  which  simply  encodes  a  k- 

ary  input  in  binary,  by  Encode{x)  =  (j/i, . .  .,j/niogJt) 
where  x  =  (xi , . . . ,  x„)  G  Zg  and 

log* 

£2^"V;■f(.•-l)log*• 
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Lemma  4.7  For  every  n-inpui,  positive,  k-ary 
weighted  multilinear  threshold  function  f  of  height  h 
and  depth  d  there  exists  an  {nlogk)-inpui  binary- 
io-k-ary  weighted  multilinear  threshold  function  g  of 
height  {k  —  l)h  and  width  d  such  that  for  all  x  €  ZjJ, 
f(x)  =  g{Encode{x)). 

Proof:  If  /  =  Q^{wi,...,Wn,ti,...,tk-i  then  g  is 
the  function  ©g(iyi,i, . . . ,  ti,  with  do¬ 
main  restricted  to  where  Wij  =  for 

1  <  i  <  n,  i  <  j  <  log  k.  □ 


iff  its  binary  encoded  equivalent  /*  ;  Zj'®**'  -♦  Zt  is 
(A,  (it  —  l)/i,  d)-separable,  where  separation  is  A  G  R+, 
0  <  A  <  1,  if  for  all  x  G  Zl;*'®®*  and  all  i  G  Zj,, 
i  <  k-l, 

fix)<i  iff 

nlogi  niogi; 

\wj\xj  <  (1  -  A)(t,+1  +  Y  (l^*!  "  ^•■)/2))- 
i=i  i=i 

Then  we  have  the  extension  of  the  result  to  the  non¬ 
positive  case: 


Lemma  4.8  If  {wi,...,Wn+k-2,i)  is  the  represen¬ 
tation  of  the  unary  slice  function  of  a  binary-io-k- 
ary  weighted  multilinear  threshold  function  f,  then 

{wi . Wn,i\,,---,tk-i)  is  a  representation  of  f, 

where 

n+k-2 

ti=zt-  Y 

j=n+i 


Corollary  4.11  Any  n-input  (A,  h,  d)-separable  k-ary 
weighted  multilinear  threshold  function  can  be  learned 
on  a  k-ary  neural  circuit  of  depth  4  size  0(nk) 
with  latency  0(n  log  k  +  k)  and  mistake  bound 

14d2  1og(nlogI:  +  fc-2)  ,  5d'\ 

- a5 - '^TJ 


Proof:  Follows  immediately  from  the  definition  of  the 
unary  slice  function.  □ 

Theorem  4.9  Any  n-input,  positive,  {X,h,d)-separa- 
ble  binary-io-k-ary  weighted  multilinear  threshold  func¬ 
tion  can  be  learned  on  a  k-ary  perceptron  with  latency 
0(n  +  ifc)  and  mistake  bound 

( I4d?  log  (n  -h  -  2)  SdN  ,  8cf^ 

[ - a5 - +Tj''  + 15- 

Proof:  (Sketch)  The  domain  of  binary-to-k-ary 
weighted  multilinear  threshold  function  is  restricted 
to  ZJ .  So,  the  learning  algorithm,  instead  of  learning 
/,  can  learn  the  unary  slice  function  for  /  using  the 
binary  winnow  learning  algorithm.  The  threshold  is 
chosen  equal  to  n  -f  it  —  2  by  Lemma  4.2.  The  unary 
slice  function  for  /  is  guaranteed  to  exist  by  Lemma 
4.5.  Once  the  unary  slice  function  has  been  learned,  / 
can  be  reconstructed  using  Lemma  4.8.  The  mistake 
bound  is  given  by  Theorem  4.3.  O 

Theorem  4.10  Any  n-input,  positive,  {X,h,d)-sepa- 
fable  k-ary  weighted  multilinear  threshold  function  can 
be  learned  on  a  k-ary  neural  circuit  of  depth  4  and  size 
0{nk)  with  latency  0{n\ogk  -{-  k)  and  mistake  bound 

p4d5|og(nbg  t  + 1  -  2)  m  ^  ^ 

Proof:  (Sketch)  The  result  is  an  immediate  conse¬ 
quence  of  Lemma  4.7  and  Theorem  4.9.  A  depth  3,  size 
0{nk)  ik-ary  threshold  circuit  can  compute  function 
Encode.  The  details  are  left  to  the  interested  reader. 
O 

The  definition  of  (A,  A,  d)-separability  can  be  ex¬ 
tended  to  non-positive  weighted  multilinear  thresh¬ 
old  functions  as  follows.  A  k-ary  weighted  multilinear 
threshold  function  /  :  ZJ  Z^  is  {X,h,d)-separable 


Proof:  (Sketch)  Suppose  /  is  a  (A,  A,<i)-separable  k- 
ary  weighted  multilinear  threshold  function.  Then  it 
has  an  (n  log  A)-input  (A,  {k—l)h,  d)-separable  binary- 
to-k-ary  weighted  multilinear  threshold  function  /i  by 
Lemma  4.7.  By  Lemma  4.5,  /i  has  an  (n  log  A -t- A- 2)- 
input  (A/d,  (A  -  l)A)-separable  unary  slice  function  f^, 
which  can  be  replaced  by  an  (n  log  A+A-2)-input  posi¬ 
tive  (A/d,  (ik-l)A)-separable  unary  slice  function  fz  by 
Lemma  4.1.  A  depth  2,  size  0{nk)  k-ary  threshold  cir¬ 
cuit  can  compute  function  Encode  and  the  negations 
of  the  appropriate  inputs.  The  winnow  algorithm  is 
used  to  learn  a  representation  for  fz.  By  Theorem  4.3, 
the  latency  is  0{n  log  k  +  k)  and  the  mistake  bound  is 

+  “)  (t  - 1)/,  +  ^. 

A  careful  analysis  gives  the  required  mistake  bound. 
□ 

The  ik-ary  winnow  algorithm  can  also  be  used  to 
learn  ik-ary  weighted  multilinear  threshold  functions 
on  a  netv/ork  of  size  k  and  depth  2  by  using  essentially 
the  same  techniques  as  were  used  for  the  perceptron 
learning  algorithm  in  Section  3.  The  details  are  left 
for  the  interested  reader. 

5  Conclusion 

The  study  of  ik-ary  neural  networks  was  justified  by 
the  observation  that  they  are  closely  related  to  analog 
neural  networks  of  bounded  precision.  We  have  seen 
two  learning  results  for  A-ary  neural  networks.  Firstly, 
we  have  demonstrated  a  k-ary  perceptron  learning  rule 
with  guaranteed  convergence.  Secondly,  Littlestone’s 
winnow  algorithm,  which  learns  binary  weighted  linear 
threshold  functions  with  a  mistake  bound  dependent 
on  their  height  has  been  extended  to  a  k-ary  winnow 
algorithm  whose  mistake  bound  depends  on  the  height 
and  width  of  the  k-ary  weighted  multilinear  threshold 
function  being  learned. 
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Abstract 

Experiments  have  revealed  that  uncon¬ 
trolled  application  of  the  analytical  learning 
paradigm  results  in  knowledge  having  low 
utility.  Because  the  performance  element 
must  consider  low  utility  knowledge  along 
with  high  utility  knowledge,  the  prolifera¬ 
tion  of  low  utility  knowledge  eventually  de¬ 
feats  the  goal  of  improved  performance.  Ex¬ 
periments  in  empirical  learning  have  demon¬ 
strated  a  similar  phenomenon.  Uncontrolled 
application  of  an  empirical  learning  paradigm 
may  result  in  inaccurate  knowledge,  and  a 
post-processing  stage  is  typically  needed  to 
repair  the  degradation  in  performance.  The 
results  from  experimentation  in  both  analyt¬ 
ical  and  empirical  learning  imply  a  general 
utility  problem  in  machine  learning.  This 
paper  presents  evidence  for  such  a  perspec¬ 
tive  and  recommei.  j  a  closer  dependence  be¬ 
tween  the  learning  paradigm  and  the  perfor¬ 
mance  goals  for  which  it  is  designed.  A  new 
approach  is  presented  along  with  experimen¬ 
tation  that  illustrates  the  applicability  of  the 
approach  to  the  general  utility  problem. 

1  Introduction 

One  of  the  main  goals  for  research  in  machine  learning 
is  the  eventual  integration  of  learning  methods  with 
knowledge-based  systems.  Learning  methods  offer  the 
ability  to  transform  the  knowledge  of  the  system  to  im¬ 
prove  performance  on  the  tasks  using  the  knowledge. 
The  ability  to  transform  knowledge  will  reduce  the  de¬ 
pendency  of  system  performance  on  the  quality  of  the 
knowledge  initially  entered  by  the  knowledge  engineer. 

Recent  experimentation  with  machine  learning 
methods  has  uncovered  a  new  obstacle  to  their  inte¬ 
gration  with  knowledge-based  systems.  Experiments 
with  several  analytical  learning  systems  demonstrated 
an  eventual  degradation  in  performance  due  to  uncon¬ 
trolled  application  of  the  learning  paradigm.  This  ob¬ 
stacle  has  been  named  the  nfthty  problem  [MintonSS]. 
Experiments  with  empirical  learning  methods  are  also 
uncovering  this  phenomenon  of  performance  degrada¬ 
tion  due  to  unconstrained  application. 


Two  additional  obstacles  block  the  integration  of 
current  learning  methods  with  knowledge-based  sys¬ 
tems.  First,  the  performance  goals  for  the  tasks  us¬ 
ing  the  knowledge  may  change  over  time.  Knowledge 
transformations  made  by  machine  learning  methods 
must  adapt  to  changes  in  the  performance  goals  for 
the  desired  tasks.  Second,  the  knowledge  may  sup¬ 
port  different  tasks  from  multiple  domains.  Knowledge 
transformations  to  improve  performance  on  one  task 
must  preserve  the  performance  goals  of  other  tasks  us¬ 
ing  the  knowledge.  These  additional  constraints  along 
with  the  original  utility  problem  combine  to  form  the 
general  utility  problem.  The  general  utility  problem 
in  machine  learning  is  the  degradation  of  performance 
for  tasks  using  the  knowledge  due  to  the  unconstrained 
transformation  of  the  knowledge  by  machine  learning 
methods. 

Recent  solutions  to  the  utility  problem  have  gen¬ 
erally  followed  the  trend  of  applying  the  learning 
method  to  every  task  and  then  pruning  away  the 
knowledge  that  eventually  turns  out  to  degrade  perfor¬ 
mance.  This  research  offers  a  different  approach  called 
performance-driven  knowledge  transformation  that  se¬ 
lectively  applies  learning  methods  only  when  necessary 
to  achieve  a  desired  performance  goal  for  some  task. 
The  approach  acquires  knowledge  for  controlling  the 
application  of  multiple  learning  methods. 

The  next  section  reviews  work  related  to  the  utility 
problem  in  analytical  learning  and  casts  recent  work 
in  empirical  learning  in  the  context  of  the  utility  prob¬ 
lem.  Section  3  defines  the  general  utility  problem  for 
machine  learning  and  discusses  alternative  solutions. 
Section  4  describes  the  performance-driven  knowledge 
transformation  approach  to  solving  the  general  utility 
problem.  Section  5  illustrates  experiments  performed 
with  an  implementation  of  the  approach  in  the  PEAK 
system.  Section  6  concludes  with  plans  for  further  im¬ 
provements  to  the  proposed  approach. 

2  Utility  Problem  in  Learning 

Research  in  both  empirical  and  analytical  learning  has 
uncovered  deficiencies  in  the  employed  methodologies. 
The  major  deficiencies  stem  from  the  naive  view  that 
the  methodology  in  question  is  always  applicable  to 
the  learning  task  and  therefore  should  always  be  ap- 
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plied  to  the  data.  In  a  peifotmance-dtiven  system,  one 
methodology  is  lately  sufficient  to  handle  the  variety 
of  learning  tasks. 

2.1  Analytical  Learning 

Research  on  analytical  (explanation-based)  learning 
techniques  began  to  focus  more  attention  on  perfor¬ 
mance  with  the  appearance  of  Keller’s  work  on  the 
definition  of  operationality  [KelletSS].  Analytical  tech¬ 
niques  learn  from  a  single  example  by  proving  the  ex¬ 
ample  is  an  instance  of  the  concept  to  be  learned.  The 
proof  terminates  when  the  leaves  of  the  proof  tree  are 
all  operational  predicates.  The  proof  tree  is  then  gen¬ 
eralized,  yielding  an  operational  description  of  the  con¬ 
cept.  Earlier  work  on  explanation-based  learning  de¬ 
fined  an  operational  concept  as  one  whose  description 
is  composed  from  a  set  of  predicates  deemed  easy  to 
evaluate  [De3ong^6,  Mitchell86].  Keller  pointed  out 
that  operationality  is  more  intimately  related  to  the 
performance  element  and  the  desired  performance  im¬ 
provement.  This  view  of  operationality  was  used  in 
the  MetaLex  system  that  learns  heuristics  for  solv¬ 
ing  calculus  problems.  MetaLex  defines  an  opera¬ 
tional  concept  as  one  that  improves  the  performance 
element’s  (problem  solver’s)  run-time  efficiency  on  a 
set  of  benchmark  calculus  problems,  while  maintain¬ 
ing  effectiveness  so  that  some  percentage  of  the  prob¬ 
lems  are  still  solved  correctly.  The  increased  attention 
on  performance  has  led  to  the  reevaluation  of  several 
anidytical  learning  systems  and  the  observation  that 
performance  may  degrade  with  repeated  application. 

2.1.1  PRODIGY 

In  experimentation  with  the  Morris  analytical  learn¬ 
ing  system,  Minton  found  that  performance  degrades 
as  the  number  of  rules  grows  large  [MintonSS].  In  order 
to  learn  a  concept,  the  system  acquires  several  rules 
whose  disjunction  forms  the  system’s  understanding 
of  the  concept.  As  the  number  of  rules  increase,  the 
cost  of  determining  the  applicability  of  a  rule  may 
outweigh  the  benefits  of  applying,  and  thus,  retain¬ 
ing  the  rule.  Minton  calls  this  phenomenon  the  uiUity 
problem  and  offers  the  Prodigy  system  as  a  solution 
[Minton88].  Prodigy  maintains  empirical  estimates 
of  match  costs,  application  savings  and  frequency  of 
application  for  each  rule.  These  estimates  are  used 
to  compute  a  utility  value  for  the  rule.  If  this  value 
becomes  negative,  the  rule  is  no  longer  considered. 
Minton  found  that  maintenance  of  a  rule’s  utility  value 
and  compression  of  the  rule’s  conditions  result  in  a 
substantial  performance  improvement.  These  results 
indicate  that  a  system  should  be  sensitive  to  the  cost 
and  savings  of  the  learned  descriptions. 

2.1.2  SOAR 

Experimentation  on  the  Soar  system  has  uncovered 
similar  results  [Tambe88].  Instead  of  monitoring  the 
cost  and  benefits  of  rules,  Tambe  and  Rosenbloom  re- 
.strict  the  expressiveness  of  the  learned  rules  so  that  the 


complexity  of  the  match  is  kept  linear  in  the  number  of 
matching  conditions  [Tambe89].  Results  of  using  this 
technique  within  SOAR  indicate  a  greater  number  of 
less  expressive  rules  are  needed  to  attain  the  generality 
^'f  the  more  expressive  rules,  but  the  match  cost  is  no 
longer  exponential.  However,  the  results  are  unclear 
on  whether  an  exponential  number  of  simpler  rules  will 
be  needed  to  achieve  the  generality  of  the  more  expres¬ 
sive  rules.  Also,  the  trend  toward  generating  ground 
instances  of  the  general  rules  seems  contradictory  to 
the  purported  benefits  of  analytical  learning. 

2.1.3  EGGS 

Despite  the  aforementioned  evidence  for  degrading 
performance  in  analytical  learning  systems,  other  such 
systems  have  demonstrated  improved  performance 
without  concern  for  the  number  ^r  form  of  the  learned 
rules.  Looking  at  systems  by  O’Rorke  [0’Rorke87] 
and  Shavlik  [Shavlik88]  Mooney  recently  uncovered 
the  reason  for  the  contradictory  results  [Mooney89]. 
The  performance  element  for  Mooney’s  experiments 
was  Eggs  [Mooney86],  which  includes  a  Horn-clause 
theorem  prover  and  standard  explanation-based  learn¬ 
ing  techniques  [DeJong86,  Mitchell86]  for  generalizing 
the  proofs. 

Experiments  with  Eggs  revealed  that  limited  use 
of  the  learned  rules  provided  greater  performance  in 
accuracy  and  speed  than  fuU  use.  Because  Shavlik 
constrained  the  proofs  to  be  no  longer  than  a  specified 
depth  bound,  his  system  was  making  only  limited  use 
of  the  learned  rules  (i.e.,  only  those  rules  that  required 
limited  chaining). 

Mooney  also  demonstrated  that  using  a  breadth- 
first  search  for  theorem  proving,  instead  of  depth-first, 
also  forced  a  limited  use  of  learned  rules.  Learned  rules 
that  would  have  required  deep  sub-goaling  to  reach  a 
solution  are  circumvented  by  the  simultaneous  con¬ 
sideration  of  proofs  from  the  original  domain  theory. 
The  use  of  breadth-first  search  in  O’Rorke’s  system  ac¬ 
counts  for  much  of  the  favorable  performance.  Mooney 
concludes  that  limited  use  of  learned  rules  is  advis¬ 
able  until  the  system  has  learned  the  rules  necessary 
to  solve  the  more  common  problems. 

2.1.4  Summary 

Experimentation  on  analytical  learning  systems 
demonstrates  performance  degradation  with  uncon¬ 
trolled  application  of  the  paradigm.  In  response,  many 
researchers  have  opted  for  more  specific  instances  of 
the  learned  rules.  The  level  of  specificity  of  the 
knowledge  should  not  be  arbitrary,  but  determined  by 
the  desired  performance.  In  fact,  with  performance- 
directed  learning  the  original  domain  theory  may  per¬ 
form  within  desired  performance  thresholds,  in  which 
case,  the  application  of  analytical  learning  may  be  un¬ 
necessary. 
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2.2  Empirical  Learning 

Empitical  learning  methods  have  traditionally  been 
designed  to  achieve  the  best  classification  performance 
possible.  However,  experimentation  described  below 
indicates  that  classification  performance  can  actually 
degrade  with  repeated  application  of  the  empirical 
learning  method. 

2.2.1  AQ 

During  experimentation  with  the  AQ  system  (specif¬ 
ically,  AQ16  [Michalski86]),  Michalski  found  that 
repetitive  application  of  AQ  can  yield  less  accurate 
concepts  than  a  more  conservative  application  strategy 
combined  with  a  simple  inference  mechanism  [Michal- 
ski87].  The  AQ  methodology  finds  a  conjunctive  de¬ 
scription  that  covers  as  many  positive  examples  as  pos¬ 
sible  without  covering  any  negative  examples.  Positive 
examples  not  coveted  by  the  first  description  are  used 
as  input  for  another  execution  of  AQ.  This  procedure 
continues  until  a  concept  in  disjunctive  normal  form  is 
produced  coveting  all  the  positive  examples  and  none 
of  the  negative  examples.  Michalski  compared  the  ac¬ 
curacy  of  the  DNF  concept  with  that  of  the  concept 
consisting  of  only  the  single  disjunct  coveting  the  most 
positive  examples.  Using  a  simple  matching  proce¬ 
dure,  the  truncated  concepts  out-performed  the  orig¬ 
inal  concepts  in  both  accuracy  and  speed.  This  ob¬ 
servation  illustrates  the  need  for  systems  to  be  more 
selective  in  their  own  behavior  when  such  selectivity  is 
sufficient  to  achieve  the  desired  performance  goals. 

2.2.2  IDS 

Similar  results  have  been  obtained  with  the  decision 
trees  generated  by  Quinlan’s  IDS  program  [Quinlan86]. 
Quinlan  found  that  pruning  the  rules  extracted  from 
a  decision  tree  can  improve  the  accuracy  of  the  rules 
on  unseen  examples  [Quinlan87].  IDS  builds  decision 
trees  by  selecting  an  attribute  from  the  training  ex¬ 
amples  providing  the  best  split  (according  to  an  infor¬ 
mation  theoretic  criterion)  between  positive  and  neg¬ 
ative  examples.  The  program  continues  by  descending 
each  branch  and  recursively  applying  itself  to  the  ex¬ 
amples  satisfying  the  attribute  value  for  that  branch. 
IDS  halts  when  all  the  nodes  at  the  frontier  of  the 
tree  conteiin  all  positive  or  all  negative  examples.  The 
pruning  stage  removes  rules  from  the  decision  tree  un¬ 
til  accuracy  on  a  set  of  test  examples  begins  to  de¬ 
crease.  Compared  to  the  original  rules,  the  pruned 
rtiles  performed  better  on  a  set  of  unseen  test  exam¬ 
ples.  Although  the  success  of  pruning  is  due  to  noise, 
missing  values,  and  the  decreased  number  of  examples 
available  at  higher  depths  in  the  tree,  this  stage  might 
have  been  unnecessary  if  the  desired  accuracy  had  been 
taken  into  account  during  the  initial  generation  of  the 
decision  rules. 

2.2.3  Summary 

Research  on  empirical  learning  has  shown  that  inexact 
rules  combined  with  a  simple  matching  procedure  can 


be  less  expensive  and  more  accurate  than  exact  rules. 
The  tradeoff  between  accuracy  and  completeness  ofthe 
learned  rules  should  be  decided  by  the  desired  perfor¬ 
mance.  Modifying  the  AQ  algorithm  to  return  only 
the  one  disjunct  covering  the  most  positive  examples 
may  avoid  generation  of  low  utility  knowledge.  Like¬ 
wise,  constraining  the  IDS  algorithm  to  stop  when  the 
leaves  have  reached  a  desired  level  of  accuracy  may 
prevent  inaccuracies  at  greater  depths  in  the  decision 
tree  and  avoid  the  need  for  pruning. 

Typically,  empirical  learning  methods  are  invoked 
to  achieve  the  best  classification  accuracy  possible.  To 
avoid  a  degradation  in  classification  performance,  em¬ 
pirical  learning  methods  should  be  invoked  only  when 
necessary  to  achieve  a  violated  performance  goal.  Fur¬ 
thermore,  repair  of  the  performance  goal  violation  may 
require  only  modest  generalization  as  opposed  to  the 
large  inductive  leaps  made  by  most  empirical  learning 
methods.  In  the  extreme  case  (e.g.,  small  number  of 
instances  in  the  concept),  rote  learning  may  be  prefer¬ 
able  to  more  powerful  empirical  learning  methods. 

3  General  Utility  Problem 

The  generation  of  low  utility  knowledge  by  both  an¬ 
alytical  and  empirical  learning  methods  indicates  a 
utility  problem  in  machine  learning  more  widespread 
than  that  identified  in  the  analytical  learning  litera¬ 
ture.  The  general  utility  problem  encompasses  not 
only  the  performance  degradation  on  one  task  due 
to  uncontrolled  application  of  learning  methods,  but 
also  adaptation  to  changing  task  performance  goals 
and  maintenance  of  performance  on  other  tasks  using 
the  knowledge.  Thus,  the  general  utility  problem  is 
informally  defined  as  follows: 

General  Utility  Problem:  performance 
degradation  on  one  or  more  tasks  due  to  the 
transformation  of  knowledge. 

In  order  to  address  the  general  utility  problem,  the 
role  of  machine  learning  methods  must  be  viewed  from 
a  purely  performance-based  perspective.  This  perspec¬ 
tive  is  similar  to  that  used  by  Keller  in  the  MetaLex 
system  [Keller87].  MetaLex  transforms  its  knowl¬ 
edge  base  by  retaining  a  new  concept  only  if  the  con¬ 
cept  improves  the  efficiency  and  effectiveness  of  the 
performance  element. 

Markovitch  and  Scott  address  the  general  utility 
problem  by  filtering  the  information  flow  from  in¬ 
stances,  to  the  knowledge  base,  and  then  to  the  perfor¬ 
mance  element  [MarkovitchS&j.  One  niter,  the  uiiiiza- 
tion  filter,  removes  harmful  rules  from  the  knowledge 
used  by  the  performance  element. 

Learning  control  knowledge  for  the  application  of 
learning  methods  to  different  tasks  is  addressed  'oy 
Rendell’s  variable-bias  management  system  (VBMS) 
[Rendell87].  VBMS  maps  different  tasks  to  points  in 
a  bias  space.  Each  point  in  bias  space  represent?  a 
choice  of  inductive  algorithm,  representation  language 
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Figure  1:  Performance  Spaces  for  Two  Tasks 

and  any  relevant  parameters  for  the  algorithm  or  lan¬ 
guage. 

The  next  section  describes  an  approach  to  the  gen¬ 
eral  utility  problem  called  performance-driven  knowl¬ 
edge  transformation.  This  approach  differs  from  the 
approach  in  MetaLex  by  making  performance  goals 
more  explicit  and  incrementally  adapting  to  changes  in 
desired  performance.  In  contrast  to  the  knowledge  fil¬ 
tering  approach,  performance-driven  knowledge  trans¬ 
formation  constrains  the  initial  generation  of  knowl¬ 
edge  as  directed  by  failure  performance  goals.  This 
approach  differs  from  the  approach  in  VBMS  in  that 
the  emphasis  is  on  selecting  learning  methods  to  re¬ 
pair  violations  in  desired  performance,  not  to  achieve 
maximum  possible  performance  on  an  isolated  learn¬ 
ing  task. 

4  Performance-Driven  Knowledge 
Transformation 

Performance-driven  knowledge  transformation  con¬ 
trols  the  application  of  learning  methods  based  on 
their  ability  to  achieve  desired  performance  goals  on 
one  task  while  preserving  the  performance  on  other 
tasks.  Each  task  for  the  knowledge  base  defines  a  per¬ 
formance  space.  The  dimensions  of  the  performance 
space  are  the  performance  goals  (e.g.,  completeness, 
correctness,  response  time)  to  be  maintained  by  the 
knowledge  base  for  that  teisk.  The  current  state  of  the 
knowledge  base  is  represented  by  a  point  in  the  per¬ 
formance  space  for  each  task.  A  knowledge  transfor¬ 
mation  can  be  viewed  as  a  move  of  the  current  knowl¬ 
edge  base  from  one  point  in  the  performance  space  of 
each  task  to  another.  Figure  1  shows  the  performance 
spaces  for  two  tasks.  Task  A  (Figure  la)  consists  of 
three  performance  goals  Gi,  and  Gz.  Task  B  (Fig¬ 
ure  lb)  consists  of  two  performance  goals  G4  and  G5. 
The  location  of  two  knowledge  bases  Ki  and  Kz  are 
shown  for  each  task. 

The  desired  performance  for  each  task  defines  a 
hyper-rectangle  in  that  task’s  performance  space. 
When  the  knowledge  base  moves  outside  the  desired- 
performance  hyper-rectangle  in  some  performance 


space,  performance-driven  knowledge  transformation 
selects  a  learning  method  to  transform  the  knowledge 
base  so  that  the  corresponding  point  in  the  perfor¬ 
mance  space  for  the  current  task  moves  back  inside 
the  desired-performance  hyper-rectangle  without  mov¬ 
ing  the  point  outside  the  desired  hyper-rectangle  in  the 
performance  spaces  for  other  tasks.  Referring  to  Fig¬ 
ure  1,  knowledge  base  Ki  has  satisfactory  performance 
for  task  B,  but  violates  the  performance  goals  of  task 
A.  Transforming  knowledge  base  Ki  to  Kz  achieves 
the  performance  goals  of  task  A  and  preserves  the  sat¬ 
isfactory  performance  for  task  B. 

This  research  proposes  an  approach  to  performance- 
driven  knowledge  transformation  implemented  in  the 
Peak  system.  When  a  performance  goal  violation  is 
detected  while  solving  a  problem  from  some  task,  the 
Peak  system  uses  information  about  the  context  of 
the  goal  violation  (e.g.,  the  difference  between  desired 
and  actual  performance)  to  select  a  transformation  op¬ 
erator  for  reducing  this  difference  while  maintaining 
other  performance  levels.  Application  of  the  opera¬ 
tor  yields  a  new  knowledge  base.  If  the  new  knowl¬ 
edge  base  achieves  the  violated  performance  goal  and 
preserves  other  performance  goals,  then  the  current 
knowledge  base  is  replaced  by  the  new  knowledge  base. 
Otherwise,  another  transformation  operator  is  selected 
for  application.  Verification  of  the  new  knowledge  base 
is  accomplished  by  using  the  knowledge  to  solve  previ¬ 
ously  seen  problems  from  the  same  task.  For  each  op¬ 
erator,  Peak  retains  information  about  the  applicabil¬ 
ity  of  the  operator  in  a  given  context  based  on  the  suc¬ 
cess  of  the  operator  in  reducing  the  goal  violation.  As 
more  performance  goal  violations  are  repaired,  Peak 
demonstrates  more  intelligent  selection  of  transforma¬ 
tion  operators  and  quicker  convergence  to  a  knowledge 
base  within  desired  performance  thresholds. 

In  the  following  discussion,  certain  assumptions 
have  been  made  about  the  knowledge  in  the  knowledge 
base  and  the  performance  element  using  this  knowl¬ 
edge.  The  knowledge  base  is  a  set  of  Horn  clause  rules. 
The  performance  element  is  a  deductive  retriever  sim¬ 
ilar  to  Prolog.  Performance  is  measured  while  the  per¬ 
formance  element  attempts  to  solve  a  query  posed  by 
the  user.  Attached  to  the  query  are  the  performance 
goals  to  be  maintained  during  the  solution  of  the  query. 
Performance  goal  violations  occur  when  the  measured 
performance  exceeds  the  desired  thresholds. 

4.1  Performance  Perspective 

Using  performance  goals  as  a  means  of  guiding  the 
maintenance  and  repair  of  a  knowledge  base  requires 
a  precise  definition  of  performance.  The  definition  of 
performance  depends  on  the  perspective.  Four  per¬ 
spectives  are  applicable  for  describing  the  performance 
of  a  knowledge  base: 

•  External  performance  is  the  performance  mea¬ 
sured  from  outside  the  knowledge  base,  regardless 
of  any  internal  knowledge  transformations. 
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•  Current  peifotmance  is  the  petfoimance  the  sys¬ 
tem  cuiiently  maintains  for  the  previously  seen 
queries. 

•  Expected  performance  is  the  performance  the 
system  expects  to  demonstrate  on  future  queries. 
Expected  performance  is  usually  the  same  as  cur¬ 
rent  performance. 

•  Absolute  performance  is  the  performance  that 
the  current  state  of  the  knowledge  would  support 
if  given  every  possible  query. 

When  the  user  specifies  a  threshold  for  some  per¬ 
formance  measure,  the  proper  perspective  must  be 
used  to  evaluate  the  performance  of  the  knowledge 
base.  Abiolute  performance  is  rarely  available  due  to 
a  lack  of  knowledge  about  the  instance  space.  Ab¬ 
solute  performance  is  inappropriate,  because  the  dis¬ 
tribution  over  the  entire  instance  space  may  not  give 
equal  probability  to  each  instance.  External  perfor¬ 
mance  provides  information  about  the  rate  of  conver¬ 
gence  towards  absolute  performance.  Changes  in  ex¬ 
ternal  performance  indicate  the  need  for  an  increase 
or  decrease  in  the  extent  of  the  knowledge  transforma¬ 
tions.  Current  performance  evaluates  the  knowledge 
only  on  previously  seen  queries.  Expected  performance 
is  the  best  measure  of  the  current  state  of  the  knowl¬ 
edge  base,  because  the  objective  of  the  knowledge  base 
is  to  maintain  its  expected  ability  to  perform  the  task 
within  desired  thresholds  on  possibly  unseen  queries. 

Performance-driven 

knowledge  transformation  should  measure  both  Both 
expected  and  external  performance  should  be  measured 
by  the  performance-driven  knowledge  transformation 
process.  Knowledge  transformations  are  triggered  only 
when  expected  performance  falls  below  desired  levels. 
External  performance  should  then  be  used  in  the  selec¬ 
tion  of  an  appropriate  transformation  operator.  The 
greater  the  difference  between  external  and  expected 
performance,  the  more  drastic  a  transformation  oper¬ 
ator  should  be  recommended  by  the  system. 

4.2  Information  on  Goal  Violations 

Once  a  goal  violation  has  been  detected,  several  pieces 
of  information  are  available  for  selecting  an  appropri¬ 
ate  knowledge  transformation  operator.  First,  as  de¬ 
scribed  in  the  previous  section,  the  difference  between 
expected  and  external  performance  indicates  the  ex¬ 
tent  of  the  necessary  transformation. 

Second,  after  the  performance  element  attempts  to 
solve  a  query,  the  violated  and  preserved  goals  are 
known.  Each  goal  contains  Information  about  the  per¬ 
formance  measure  that  this  goal  constrains,  the  de¬ 
sired  threshold  on  the  measure,  the  observed  value  of 
the  measure  on  previously  seen  queries  (including  the 
query  just  processed),  and  the  difference  between  the 
observed  and  desired  performance  (the  error).  The 
performance  measure  constrained  by  a  violated  goal  is 
useful  for  selecting  transformation  operators  capable  of 
improving  this  performance  measure.  Tlie  magnitude 


of  the  error  indicates  the  extent  of  the  transformation. 
The  performance  measure  constrained  by  a  satisfied 
goal  is  useful  for  selecting  transformation  operators 
capable  of  preserving  this  performance  measure.  The 
magnitude  of  the  error  incficates  the  extent  to  which 
the  selected  operator  may  degrade  performance  on  the 
satisfied  goals  in  order  to  achieve  performance  on  the 
violated  goals. 

A  third  source  of  information  that  will  be  avail¬ 
able  upon  detection  of  a  performance  goal  violation 
is  the  task  history.  Each  task  known  to  the  knowledge 
base  maintains  a  task  history  of  previously  seen  queries 
from  the  task.  The  task  history  serves  two  purposes. 
First,  the  task  history  represents  an  empirical  estimate 
of  the  distribution  over  the  possible  queries  of  the  task. 
This  distribution  can  be  used  to  verify  the  achievement 
of  violated  performance  goals  in  transformed  knowl¬ 
edge.  Second,  an  entry  in  the  task  history  contmns  in¬ 
formation  about  the  query-solving  episode.  One  useful 
piece  of  information  about  a  query-solving  episode  is 
the  trace  of  the  knowledge  accessed  during  the  solu¬ 
tion. 

The  knowledge  trace  is  an  and/or  tree  that  records 
the  knowledge  accessed  during  the  solution  of  the 
query  and  indicates  which  rules  (if  any)  support  the 
response  to  the  query.  Information  about  the  shape 
of  a  task’s  knowledge  traces  constrains  the  selection  of 
knowledge  transformations.  For  example,  wide,  shal¬ 
low  knowledge  traces  indicate  that  the  knowledge  con¬ 
sists  of  specific  instances  of  the  task;  whereas  narrow, 
deep  knowledge  traces  indicate  a  more  general  set  of 
rules  for  proving  queries  from  the  corresponding  task 

Finally,  past  success  of  the  transformation  oper> 
tors  provides  information  upon  performance  goal  vi¬ 
olation.  As  the  knowledge  base  transforms  to  meet 
performance  goals,  a  record  is  kept  of  the  old  and  new 
knowledge  bases  along  with  the  operator  responsible 
for  the  transformation.  If  the  new  knowledge  base 
achieves  a  violated  god  while  preserving  non-violated 
goals,  then  the  system  increases  the  operators  appli¬ 
cability  for  achieving  and  preserving  the  appropriate 
goals.  Over  time,  collection  of  this  inform  ‘ion  will 
allow  the  system  to  make  a  more  informed  operator 
selection  based  on  past  experience. 

4.3  Verification  of  Knowledge  Base 

Because  no  operator  application  is  guaranteed  to 
achieve  the  desired  results,  the  system  must  verify 
that  the  knowledge  base  resulting  from  an  operator 
application  achieves  the  desired  performance.  Verifi¬ 
cation  can  be  accomplished  by  le  solving  the  queries 
in  the  task  history.  The  size  of  the  task  history  can  be 
changed  to  tradeoff  performance  convergence  rates  for 
transformation  speed.  As  the  system  learns  operator 
applicability,  there  is  less  chance  that  several  alterna¬ 
tive  operator  applications  must  be  tried  before  find¬ 
ing  one  that  achieves  the  violated  god.  Because  each 
transformation  attempt  requires  verification  of  the  re¬ 
sulting  knowledge  base,  the  fewer  attempts  necessary 
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implies  fewer  verifications;  thus,  the  task  history  size 
can  increased  over  time. 

5  Experimentation 

This  section  illustrates  the  application  of  Pkak  on  two 
tasks  from  diverse  domains.  The  first  experiment  in¬ 
volves  learning  to  improve  response  time,  completeness 
and  correctness  while  determining  whether  to  land  the 
space  shuttle  manually  or  automatically  depending  on 
environmental  conditions.  The  second  experiment  in¬ 
volves  learning  to  improve  response  time  while  con¬ 
structing  plans  to  build  towers  in  the  blocks-world  do¬ 
main.  Together,  the  two  experiments  demonstrate  the 
ability  of  performance-drive  knowledge  transformation 
to  selectively  apply  appropriate  learning  methods  to 
achieve  desired  performance  goals. 

5.1  Shuttle  Domain 

This  experiment  executes  the  Peak  system  on  the 
shuttle  landing  control  database  available  from  the 
machine  learning  databases  maintained  by  University 
of  California  at  Irvine.  The  problem  is  to  determine 
whether  to  land  the  shuttle  manually  or  automatically 
based  on  environmental  attributes.  The  corresponding 
task  is  labeled  the  landing  task,  and  the  queries  arc 
of  the  form  landing (ENV,?x).  The  ENV  in  the  query 
represents  the  environmental  situation  to  be  evalu¬ 
ated.  The  performance  element  attempts  to  fill  in  the 
?x  with  the  recommended  landing  control:  auto  or 
noauto. 

Prior  to  query  answering,  the  user  inputs  the  per¬ 
formance  thresholds  to  be  mmntained  by  the  knowl¬ 
edge  base  while  answering  landing  queries  using  the 
performance  element  (a  backward-chaining  deductive 
theorem  prover  for  Horn  clauses).  For  this  experi¬ 
ment,  three  performance  goals  are  specified:  correct¬ 
ness,  completeness  and  response  time.  The  correctness 
goal  specifies  that  the  answers  to  queries  must  be  cor¬ 
rect  90%  of  the  time.  The  completeness  goal  specifies 
that  the  query  must  be  answered  95%  of  the  time. 
That  is,  the  answer  should  be  either  auto  or  noauto 
and  not  “J  don’t  know”.  The  response  time  goal  speci¬ 
fies  that  the  performance  element  must  respond  within 
10  seconds. 

Two  knowledge  transformation  operators  are  avail¬ 
able:  rote  learning  and  empirical  learning.  Application 
of  the  rote  learning  operator  asks  the  user  for  the  cor¬ 
rect  answer  to  the  query.  A  new  rule  is  added  to  the 
knowledge  base  having  the  instantiated  query  as  the 
consequent,  and  the  facts  defined  before  query  execu¬ 
tion  A'  the  antecedent.  The  empirical  learning  oper¬ 
ator  utilizes  the  ID3  program  to  buUd  a  decision  tree 
from  examples  in  the  knowledge  base.  The  examples 
are  rules  such  eis  those  learned  by  the  rote  operator. 
Each  path  in  the  resulting  decision  tree  is  converted 
to  a  •  ule.  The  examples  are  replaced  by  the  new  rules 
in  the  transformed  knowledge  base. 


Starting  with  an  empty  knowledge  base.  Peak  at¬ 
tempts  to  solve  landing  queries,  while  maintaining  the 
performance  goals.  Figure  2  plots  the  three  perfor¬ 
mance  goals  for  200  randomly  chosen  queries  from  the 
shuttle  landing  control  domain. 

Figure  2a  illustrates  how  PEAK  maintains  response 
time  performance  below  10  seconds.  For  the  first  30 
queries,  response  time  increases  as  the  number  of  rote- 
learned  rules  increases.  Eventually,  the  large  number 
of  rules  in  the  knowledge  base  cannot  be  traversed 
within  the  response  time  threshold. 

While  processing  the  30th  query,  PEAK  was  unable 
to  solve  the  query,  generating  a  completeness  failure. 
Peak  first  ttys  to  transform  the  knowledge  base  by 
rote-learning  a  new  rule.  However,  verification  of  the 
new  knowledge  base  uncovers  a  response  time  fail¬ 
ure.  Because  the  rote  learning  operator  was  ineffec¬ 
tive,  Peak  chose  to  apply  the  ID3  operator.  IDS  gen¬ 
eralized  the  29  learned  instances  into  8  general  rules. 
As  Figure  2a  indicates,  the  resulting  transformation 
drastically  improves  response  time  performance. 

The  plot  of  completeness  performance  in  Figure  2b 
illustrates  how  Peak  quickly  learns  the  initial  query 
knowledge.  After  the  ID3  transformation,  complete¬ 
ness  remained  above  the  95%  threshold  for  the  remain¬ 
der  of  the  200  queries. 

The  correctness  plot  in  Figure  2c  shows  how  per¬ 
formance  starts  at  100%  and  converges  to  the  desired 
90%  threshold.  The  initial  values  of  100%  for  correct¬ 
ness  are  due  to  the  fact  that  many  of  the  initial  queries 
could  not  be  answered.  Correctness  performance  only 
measures  the  correctness  of  answered  queries.  Imme¬ 
diately  following  the  application  of  ID3,  correctness 
falls  to  94%  due  to  the  next  two  queries  being  incor¬ 
rectly  answered  according  to  the  new  knowledge  base. 
As  query  answering  continues,  the  over-generalization 
in  the  rules  eventually  brings  correctness  down  below 
the  90%  threshold.  Correctness  violations  occur  at 
queries  89,  98,  153  and  163.  In  each  case.  Peak  uses 
the  rote-learning  operator  to  memorize  the  incorrectly 
answered  query  and  restore  90%  correctness  perfor¬ 
mance. 

The  final  knowledge  base  after  completion  of  the 
200  queries  consists  of  the  12  rules  shown  in  Figure  3. 
Rules  5-12  are  the  general  rules  learned  by  ID3.  Rules 
1-4  are  the  specific  instances  learned  to  repair  the  over¬ 
generalization  in  ID3’s  rules.  After  200  queries,  the 
knowledge  base  converged  to  8  general  rules  describing 
major  trends  in  the  shuttle  landing  domain  and  four 
specific  rules  for  special  \,aC"S  not  handled  correctly  by 
the  general  rules. 

One  final  observaiion  from  Figure  2  is  the  conver¬ 
gence  of  the  performance  vov.’ards  the  desired  thresh¬ 
olds  and  not  towards  the  maximum  possible  perfor¬ 
mance.  This  indicates  how  performance-driven  knowl¬ 
edge  transformation  utilizes  flexibility  in  one  dimen¬ 
sion  of  performance  to  improve  performance  in  another 
dimension. 
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Figure  2:  Plots  of  Performance  for  Shuttle  Domain 


1.  landing(a!, noauto)  ♦—  sign(5;,nn)  &  wind(a!,head)  &  stability(a5,xstab)  &  error(a!,MM)  & 

magnitude(iE, Medium)  &  visibility(ie,yes) 

2.  landing(a:, noauto)  <—  sign(a,pp)  &  wind(x,tail)  &  stability(x,xstab)  &  error(x,MM)  & 

magnitude(x,Low)  &  visibility(x,yes) 

3.  landing(x, noauto)  ♦—  sign(x,nn)  &  wind(x,head)  &  stability(x,stab)  &  error(x,MM)  & 

magnitude(x,OutOfRange)  &  visibility(x,yes) 

4.  landing(x, noauto)  *-  sign(x,nn)  &  wind(x,tail)  &  stability(x,xstab)  &  error(x,MM)  & 

magnitude(x,Low)  &  visibility(x,yes) 

5.  landing(x,auto)  ^  error(x,MM)  &  visibility(x,yes) 

6.  landing(x,auto)  «—  stabiUty(x,stab)  &  error(x,SS)  &  magnitude(x, Strong)  &  visibility(x,yes) 

7.  landing(x,auto)  ♦—  visibility(x,no) 

8.  !anding(2, noauto)  i—  '‘tror(x,XL)  &  visibility(x,ycs) 

9.  landing(x, noauto)  «—  error(x,LX)  &  visibility(x,yes) 

10.  landing(x, noauto)  <—  stability(x,xstab)  &  error(x,SS)  &  magnitude(x, Strong)  &  visibility(x,yes) 

11.  landing(x, noauto)  <—  error(x,SS)  &  magnitude(x,OutOfRange)  &  visibility(x,yes) 

12.  landing(x, noauto)  «—  error(x,SS)  &  magnitude(x,Low)  &  visibility(x,yes) 


Figure  3:  Shuttle  Domain  Knowledge  Base  After  200  Queries 
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5.2  Blocks-World  Domain 

In  the  task  from  the  blocks-woild  domain,  the  user 
asks  the  performance  element  to  construct  a  plan  for 
building  a  tower  of  blocks.  The  queries  are  of  the  form 
tover(i  B  C  ?state),  where  A,  B  and  C  are  blocks, 
and  ?state  is  a  variable  to  be  instantiated  with  the 
plan  for  achieving  the  tower. 

Prior  to  query  answering,  the  user  inputs  the  perfor¬ 
mance  thresholds  to  be  maintained  by  the  knowledge 
base  while  answering  iower  queries.  For  this  experi¬ 
ment,  one  performance  goal  is  specified:  response-time 
<  10  seconds.  Performance  goals  for  completeness  and 
correctness  are  inappropriate,  because  the  domain  the¬ 
ory  is  assumed  complete  and  correct. 

In  addition  to  the  rote  learning  and  IDS  opera¬ 
tors  used  with  the  first  experiment,  an  explanation- 
based  generalizer.  Eggs  [MooneySC],  is  included  in  the 
Peak  systetu.  Egos  applies  standard  explanation- 
based  techniques  [DeJong86,  Mitchell86]  to  general¬ 
ize  the  proofs  obtadned  by  the  performance  element. 
When  Eggs  is  applied  to  a  proof,  the  result  is  a  gen¬ 
eral  rule  that  is  added  to  the  knowledge  base. 


Figure  4:  Plot  of  Response  Time  for  Tower  Domain 


Starting  with  the  blocks-world  domain  theory,  Peak 
attempts  to  solve  tower  queries,  while  maintaining  the 
response  time  performance  goal.  The  initial  state  of 
the  blocks  world  contained  six  blocks.  Figure  4  shows 
the  response  time  obtained  by  Peak  for  100  semi¬ 
randomly  chosen  tower  queries.  Semi-random  means 
that  the  first  ten  queries  v.’ere  a!!  towers  of  height  t’.vo 
(the  two  blocks  to  be  in  the  tower  were  chosen  ran¬ 
domly  from  the  six  blocks  in  the  initial  state).  The 
second  ten  queries  were  all  towers  of  height  three,  and 
so  on  for  the  first  50  queries.  The  second  50  queries 
repeat  the  above  sequence  to  show  the  effects  on  re¬ 
sponse  time  of  the  rules  learned  during  the  first  50 
queries. 

As  shown  in  Figure  4,  the  first  20  queries  (towers  of 


height  two  and  three)  are  solved  by  the  original  domain 
theory  within  the  response  time  threshold.  However, 
the  dommn  theory  is  unable  to  maintain  the  response 
time  performance  goed  while  processing  the  21st  query 
(tower  of  height  three).  At  this  point,  IDS  cannot  be 
applied  due  to  the  lack  of  examples  in  the  knowledge 
base.  The  Eggs  operator  is  chosen  over  rote  learning 
due  to  the  knowledge  trace  for  the  query.  The  deep, 
wide  proof  tree  suggests  that  Eggs  is  more  likely  to 
succeed  than  IDS. 

Application  of  Eggs  yields  a  general  rule  that  builds 
any  tower  of  height  three  in  one  step.  Thus,  the  re¬ 
mainder  of  the  tower  queries  for  height  three  are  com¬ 
pleted  within  the  response  time  threshold.  Similar 
rules  are  learned  for  the  Slst  query  (tower  of  height 
four)  and  41st  query  (tower  of  height  five).  Figure  4 
shows  that  retrying  the  towers  of  heights  two  through 
five  (queries  50-100)  results  in  no  response  time  perfor¬ 
mance  violations  due  to  the  previously  learned  rules. 

This  experiment  demonstrates  Peak’s  ability  to 
constrain  the  application  of  the  Eggs  analytical  learn¬ 
ing  algorithm.  Application  of  Eggs  was  unnecessary 
for  towers  of  height  two  and  three,  because  the  original 
domain  theory  was  able  to  solve  these  queries  within 
the  desired  performance  thresholds.  However,  the  do¬ 
main  theory  was  unable  to  support  the  desired  perfor¬ 
mance  for  towers  of  height  four,  five  and  six,  requiring 
three  applications  of  Eggs  to  learn  general  rules  for 
these  specific  cases.  As  the  performance  on  the  second 
50  queries  indicates,  the  original  domain  theory  plus 
the  three  learned  rules  was  sufficient  to  maintain  the 
desired  performance  for  the  tower-building  task. 


6  Conclusions 

In  order  to  integrate  machine  learning  methods  with 
knowledge-based  systems,  the  general  utility  problem 
in  machine  learning  must  be  addressed.  Evidence  for 
the  general  utility  problem  has  been  found  in  exper¬ 
imentation  on  both  analytical  and  empirical  learning 
methods.  Unconstrained  application  of  these  meth¬ 
ods  has  been  shown  to  degrade  the  performance  they 
were  designed  to  improve.  Learning  methods  should 
be  invoked  only  after  a  performance  failure  hzis  been 
detected,  and  then,  only  if  the  learning  method  is  ap¬ 
plicable  to  the  properties  of  the  failure.  Furthermore, 
learning  methods  must  permit  transformation  of  the 
knowledge  to  achieve  performance  goals  without  vi¬ 
olating  the  performance  goals  associated  with  other 
tasks  using  the  knowledge.  The  learning  methods  must 
also  have  the  ability  to  adapt  to  changing  pcrfcrinance 
goals. 

Performance-driven  knowledge  transformation  offers 
an  approach  that  addresses  the  general  utility  prob¬ 
lem.  Learning  methods  are  invoked  only  when  neces¬ 
sary  to  improve  performance,  and  in  accordance  with 
previous  success  in  repairing  the  violated  performance 
goal.  The  performance-driven  knowledge  transforma¬ 
tion  approach  has  been  implemented  in  the  Peak  sys- 
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tern.  Experimentation  with  Peak  demonstrates  the 
ability  to  control  the  application  of  learning  methods 
to  achieve  desired  performance  goals. 

More  experimentation  is  necessary  to  validate  the 
use  of  performance-driven  knowledge  transformation. 
Experiments  with  modihed  versions  of  the  AQ  and  ID3 
algorithms  will  indicate  the  usefulness  of  these  trans¬ 
formation  methods  to  avoid  inaccuracies  due  to  uncon¬ 
strained  application  of  the  paradigms.  Experiments 
with  the  interaction  of  multiple  learning  methods  will 
indicate  the  usefulness  of  current  control  knowledge 
and  suggest  the  need  for  other  knowledge  to  con¬ 
trol  the  performance-driven  knowledge  transformation 
process. 
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Abstract 

Ids  is  an  integrated  discovery  system  that 
forms  taxonomic  hierarchies,  notes  qualita¬ 
tive  relations,  and  finds  numeric  laws.  This 
paper  focuses  oni  the  system’s  algorithm  for 
numeric  discovery.  We  describe  the  basic 
method  in  terms  of  heuristic  search  through  a 
space  of  numeric  terms,  provide  a  simple  ex¬ 
ample,  and  show  how  lbs  uses  the  algorithm 
to  find  three  classes  of  numeric  relations.  We 
then  evaluate  the  system’s  law-finding  ability 
through  experiments  with  artificial  domains, 
examining  the  effect  of  system  parameters, 
noise  level,  number  of  irrelevant  terms,  and 
law  complexity  on  both  asymptotic  accuracy 
and  learning  rate.  Finally,  we  consider  the  al¬ 
gorithm’s  relation  to  other  numeric  methods 
and  outline  directions  for  future  work. 


1  Introduction 

The  discovery  of  numeric  laws  is  a  central  part  of  the 
scientific  process.  In  this  paper  we  focus  on  the  nu¬ 
meric  discovery  component  of  Ids  (Nordhausen,  1989), 
an  empirical  discovery  system.  In  the  following  sec¬ 
tions,  we  detail  the  system’s  algorithm  for  finding 
quantitative  relations,  and  then  report  experiments 
with  artificial  domains  to  evaluate  the  algorithm’s  abil¬ 
ity  to  find  laws  under  a  variety  of  conditions.  Finally, 
we  consider  some  directions  for  future  research  and  dis¬ 
cuss  the  method’s  relation  to  earlier  work.  However, 
let  us  first  briefly  describe  the  overall  system  in  which 
this  algorithm  plays  a  role. 

Ids  is  an  integrated  discovery  system  that  represents 
observations  and  laws  using  Forbus’  (1985)  qualitative 
process  formalism.  Each  observed  history  consists  of  a 
temporal  sequence  of  qualitative  states,  which  repre¬ 
sent  intervals  during  which  the  signs  of  derivatives  re¬ 
main  constant  and  during  which  the  structure  remains 
unchanged.  The  system  employs  an  incremental  clus¬ 
tering  method,  simil^lr  to  Lebowitz’  (1987)  Unimem 
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and  Fisher’s  (1987)  Cobweb,  to  classify  new  quali¬ 
tative  states  and  incorporate  them  into  a  taxonomic 
hierarchy.  IDS  also  stores  links  that  indicate  temporal 
succession  among  states,  along  with  transition  condi¬ 
tions  between  pairs  of  states.  Moreover,  the  system 
forms  generalized  links  that  connect  abstract  qualita¬ 
tive  states  in  its  hierarchy;  together  with  the  transi¬ 
tion  conditions,  these  constitute  an  important  class  of 
qualitative  laws. 

2  Numeric  Discovery  in  IDS 

In  addition  to  the  above  activities.  Ids  searches  for 
numeric  laws  to  augment  its  qualitative  descriptions. 
These  may  specify  the  conditions  for  moving  from  one 
state  to  another,  a  relation  between  numeric  attributes 
within  a  given  qualitative  state,  or  a  quantitative  re¬ 
lation  between  variables  in  different  states.  Each  of 
these  cases  involves  storing  a  law  at  a  node  or  link 
in  the  teixonomy  that  summarizes  information  in  the 
children  of  that  node  or  link.  Nordhausen  (1989)  de¬ 
scribes  the  overall  system  in  detail;  in  this  paper  we 
will  limit  our  attention  to  numeric  discovery. 

2.1  The  Basic  Numeric  Discovery  Algorithm 

Ids  uses  a  single  procedure  to  find  all  forms  of  numeric 
laws.  Briefly,  whenever  the  system  adds  a  new  qualita¬ 
tive  state  S  (or  transition  link  T)  to  an  existing  node 
(or  link)  in  the  hierarchy,  it  checks  to  see  if  S  (or  T) 
obeys  the  laws  currently  stored  at  the  node.  If  not. 
Ids  searches  for  new  laws  that  cover  the  new  child  and 
its  siblings,  using  the  old  law  as  a  starting  point. 

For  a  given  data  set,  the  system  attempts  to  find 
a  law  that  covers  these  data  by  conducting  a  beam 
search  through  the  space  of  numeric  terms.  More  pre¬ 
cisely,  the  search  task  can  be  stated  as: 

•  Given;  a  set  of  base  terms  a,  b,c,.. .,  along  with 
one  desigaaled  luriii  (u)  from  that  set; 

•  Find:  a  term  x  =  a"®  •  b"'  •  c"® . . .  such  that  a 
linear  relation  of  the  form  a  =  mx  n  holds. 

Ids  searches  from  simple  terms  to  more  complex  ones, 
using  correlation  analysis  (Freund  &  Walpole,  1980) 
to  direct  the  search  process.  As  in  Langley,  Brad¬ 
shaw,  and  Simon’s  (1983)  Bacon,  the  basic  operators 
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Table  1.  The  Ids  algorithm  for  finding  numeric  laws. 


Variables:  S  is  the  set  of  base  terms; 

D  is  the  designated  term; 

A  is  a  defined  term; 

C  is  the  set  of  current  terms; 

P  and  Q  are  sets  of  terms. 

Find-numeric-law(D,  S,  C) 

Let  A  be  the  term  in  C  that  has  the  highest 
correlation  with  D. 

If  the  correlation  between  D  and  A  is  high  enough, 
Then  call  linear  regression  on  D  and  A 
to  find  the  slope  and  intercept. 

Return  A,  the  slope,  and  the  intercept. 

Else  if  the  maximum  search  depth  is  reached, 
then  return  the  empty  set. 

Else  let  G'  be  Pind-best-terms(D,  S,  C). 
Find-numeric-law(D,  S,  C'). 


ga$(b) 

t(b)=:21.0 

p(b)=100.0 

v(b)=24.46 


gas(c) 

t(c)=22.0 

p(c)=200.0 

v(c)=12.27 


gas(d) 

t(d)=23.0 

p(d)=300.0 

v(d)=8.21 


Figure  1.  A  spurious  relation  found  during  the 
rediscovery  of  the  ideal  gas  law. 


Find-be8t-terms(D,  S,  C) 

Let  P  be  the  products  of  the  terms  of  S  and  C. 
Let  Q  be  the  quotients  of  the  terms  of  S  and  C. 
For  eacb  term  A  in  the  union  of  P  and  Q, 
Compute  the  correlation  between  D  and  A. 
Return 'the  terms  with  the  N;  highest  correlations, 

Parameters: 

Width  of  the  beam  N  (memory  size); 

Threshold  of  the  correlation  (accuracy); 
Maximum  degree  of  terms  (law  complexity); 
Maximum  search  depth  (when  to  halt). 


involve  defining  new  terms  as  products  and  ratios  of 
existing  terms.  The  system  initially  examines  correla¬ 
tions  between  the  designated  term  and  observable  at¬ 
tributes,  uses  these  to  select  promising  products  and 
ratios,  and  then  recurses  if  it  cannot  find  a  law  with 
the  existing  terms. 


This  search  technique  has  a  semi-incremental  fla¬ 
vor.  In  cases  where  Ids  has  rejected  an  existing  law, 
there  is  no  need  to  reconsider  the  term  in  that  law  and 
those  leading  to  the  law.  Thus,  it  uses  the  old  term  as 
the  starting  point  for  the  new  search,  saving  consider¬ 
able  eflfqrt  over  an  approach  that  starts  from  scratch. 
However,  this  method  does  require  that  one  store  and 
reprocess  all  the  data  that  led  to  the  rejected  law.  As 
a  result,  it  does  not  quite  fit  with  the  strict  definition 
of  an  incremental  learning  system,  though  we  hope  to 
modify  this  in  future  versions  of  Ids. 


Table  1  gives  the  basic  algorithm  for  finding  numeric 
relations.  The  top-level  function,  find-numeric-law, 
is  given  three  arguments:  the  designated^  term  D,  the 
set  of  base  terms  S,  and  a  set  of  current  terms  C.  If 

*  As  detailed  in  Section  2.3,  the  system  iterates  through 
multiple  base  terms,  treating  each  in  turn  as  the  designated 
term  and  searching  for  a  law  that  predicts  each  one. 


Ids  is  attempting  to  revise  an  existing  law,  C  contains 
only  the  term  occurring  in  the  right-hand  side  of  that 
law.  If  the  system  is  searching  for  a  new  law,  C  is  the 
set  of  observable  terms  S. 

At  each  point- in  the  search.  Ids  defines  all  of  the 
products  and  ratios  between  the  terms  in  the  set  S 
and  those  in  C,  but  it  retams  only  those  terms  having 
the  highest  correlations  with  the  designated  term  D. 
These  new  terms  become  the  current  set  C,  and  the 
function  find-numeric-law  is  called  recursively,  with 
the  designated  terin  D  and  the  base  terms  S  remain¬ 
ing  the  same.  If  any  term  in  C  has  a  sufficiently  high 
correlation  with  D,  Ids  ends  the  search  and  uses  a  re¬ 
gression  technique  to  find  the  slope  and  intercept  of 
the  line  relating  them.  The  system  continues  in  this 
fashion  until  it  finds  such  a  linear  relation  or  until  it 
exceeds  the  maximum  search  depth.  If  the  search  fails. 
Ids  assumes  that  no  law  covers  all  the  observed  data. 
As  we  discuss  in  Section  3,  the  system  includes  four 
parameters  that  constrain  its  search  for  numeric  laws. 

2.2  An  Example:  Finding  the  Ideal  Gas  Law 

As  an  example,  let  us  consider  how  Ids  rediscovers  the 
ideal  gas  law.  The  system  receives  data  in  the  form  of 
states  with  gaseous  objects  at  different  temperatures, 
pressures,  and  volumes.  Figure  1  shows  the  hierarchy 
after  the  system  has  processed  three  states,  with  all 
instances  stored  under  a  common  parent  node.  Given 
these  data.  Ids  finds  a  law  relating  the  temperature 
and  the  pressure,  because  one  can  express  the  tem¬ 
perature  as  a  linear  function  of  the  pressure.  Now 
the  system  observes  a  fourth  instance,  which  it  adds 
as  a  child  of  Node  1  because  it  matches  the  parent 
completely.^  However,  this  new  instance  violates  the 
numeric  law  stored  at  the  parent  node,  causing  Ids  to 
search  for  a  new  relation  that  covers  all  four  instances. 


’The  system  does  not  consider  whether  instances  satisfy 
numeric  laws  during  the  clustering  process. 
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gas(b) 

t(b)=21.0 

p(b)=100.0 

v(b)=24.46 


8a5(c) 

t(c)=22.0 

p(c)=200.0 

v(c)=12.27 


gas(d) 

t(d)=23.0 

p(d)=300.0 

v(d)=8.21 


gas(e) 

t(e)=24.0 

p(«)=230.0 

v(e)=10.74 


Figure  2.  A  correct  version  of  the  ideal  gas  law. 


Because  the  term  P  was  used  in  the  rejected  law,  Ids 
calls  the  function  lind-numeric-law  with  {P}  as  the 
current  set  C,  T  as  the  designated  term,  and  {P,  V,  T} 
as  the  set  of  base  terms,  S.  In  other  words.  Ids  uses  the 
term  P  as  the  entry  point  in  the  search  space,  starting 
by  combining  P  with  the  terins  in  S  to  form  prod¬ 
ucts  and  ratios  such  as  PV,  P*,  P/T,  and  P/V.  Of 
these  new  terms,  PV  has  a  high  enough  correlation  to 
end  the  search.  Regression  produces  the  numeric  law 
T=  0.12  X  PV  -  273;  this  version  is  equivalent  to  the 
standard  form  of  the  law,  PV=:  8.32(T  +  273).  Figure 
2  shows  the  hierarchy  that  emerges  after  this  revision 
is  complete.  As  the  system  processes  more  instances 
and  stores  them  under  the  parent  node,  it  finds  that 
they  obey  this  new  relation,  so  iind-numeric-law  is 
not  called  again. 

Ids  also  defines  new  terms  by  a  second  method  simi¬ 
lar  to  that  used  in  Bacon.  Whenever  the  system  finds 
linear  relations,  it  introduces  the  slopes  and  the  inter¬ 
cepts  of  these  relations  as  new  numeric  terms.  Then 
Ids  uses  these  new  quantities  to  find  more  complex 
quantitative  laws  at  higher  levels  in  the  state  hier¬ 
archy.  Consider  agsdn  the  discovery  of  the  ideal  gas 
law.  In  addition  to  the  temperature  (T),  the  pressure 
(P),  and  the  volume  (V),  suppose  Ids  is  also  given  the 
number  of  moles  (N)  of  each  gas.  When  the  system  ob¬ 
serves  gases  with  1.0  mole  at  different  temperatures, 
pressures,  and  volumes,  it  finds  a  linear  relation  be¬ 
tween  T  and  PV  in  the  manner  described  above.  At 
this  point.  Ids  defines  the  slope  s  and  the  intercept  i 
of  the  relation,  which  have  values  of  0.12  and  —273, 
respectively. 

Upon  processing  new  states  with  different  numbers 
of  moles,  the  system  calls  on  lind-muneric-law  with 
{PV}  as  the  set  of  current  terms.  In  this  manner.  Ids 
finds  that  T  is  linearly  related  to  PV  without  search. 
Moreover,  the  system  treats  the  slope  and  intercept 
(s  and  i)  as  higher-level  dependent  terms,  calling  on 
its  numeric  discovery  method  to  find  a  law  relating 


them  to  N.  As  a  result.  Ids  finds  that  the  intercept  i  is 
constant  regardless  of  the  number  of  moles  N,  but  that 
the  slope  s  is  proportional  to  N.  The  system  stores  both 
laws  at  a  more  abstract  state  in  the  hierarchy,  giving 
a  set  of  relations  equivalent  to  the  standard  version  of 
the  ideal  gas  law:  PV  =  8.32  N  (T  -f  273). 

2.3  Three  Uses  of  the  Basic  Algorithm 

Now  that  we  have  explained  the  basic  method  for  dis¬ 
covering  numeric  laws,  let  us  describe  how  Ids  uses 
this  algorithm  to  discover  three  different  forms  of  such 
laws.  The  system  augments  the  hierarchy  of  qualita¬ 
tive  states  with  numeric  relations  in  three  ways,  each 
of  which  serves  a  different  purpose: 

•  A  law  within  a  state  relates  quantities  that  are 
constant  within  that  state.^ 

•  A  numeric  relation  on  a  transition  link  specifies 
the  numeric  conditions  for  moving  from  the  cur¬ 
rent  state  to  its  immediate  successor. 

•  A  numeric  law  between  two  states  within  the  same 
qualitative  history  relates  a  quantity  in  a  succes¬ 
sor  state  (sometimes  memy  steps  sdiead)  to  quan¬ 
tities  in  the  current  state. 

Recall  that  lind-numeric-lan  takes  as  arguments  the 
designated  termD,  the  set  of  base  terms  S,  and  the  set 
of  current  terms  C.  The  initial  settings  of  these  argu¬ 
ments  differ  in  the  three  forms  because  they  emphasize 
different  quantities. 

Thus,  when  Ids  encounters  the  first  case  -  finding  a 
relation  involving  some  quantity  that  occurs  within  a 
state  -  it  calls  find-numeric-law  with  that  quantity 
as  the  designated  term  D,  and  the  union  of  all  within- 
state  quantities  and  the  transition  quantities  as  the  set 
of  base  terms  S.  The  system  repeats  this  process  for 
each  quantity  in  the  state.  Nordhausen  (1989)  reports 
a  number  of  within-state  relations  that  Ids  fiiids  in 
this  manner,  including  the  ideal  gas  law,  the  equality 
of  final  temperatures  in  heat-mixture  experiments,  and 
the  constant  ratios  of  molar  concentrations  that  occur 
in  reactions  involving  chemical  equilibrium. 

Similarly,  when  Ids  attempts  to  find  relations  on  a 
transition  link  between  two  states,  a  transition  quan¬ 
tity  becomes  the  designated  term  D,  whereas  the  set  of 
base  terms  is  again  the  union  of  all  within-state  quan¬ 
tities  and  transition  quantities.  This  lets  the  system 
discover  the  conditions  for  moving  between  qualitative 
states.  Nordhausen  (1989)  recounts  a  variety  of  such 
discoveries,  involving  melting  and  boiling  points,  fluid 
flow  phenomena,  and  the  rates  of  chenucal  reactions. 
For  instance,  we  presented  Ids  with  a  set  of  three- 
state  histories  in  which  two  containers  o  and  b  begin 
with  different  levels  of  fluid  La  and  Li,.  After  opening 
a  conduit  between  the  containers,  one  level  increases 

®  When  IDS  is  given  a  qualitative  history,  it  is  told  which 
quantities  are  constant  within  each  state  and  which  are 
changing.  The  current  system  attempts  to  find  numeric 
laws  only  for  the  former  terms,  but  in  principle  the  algo¬ 
rithm  in  Table  1  could  also  be  used  to  discover  relations 
between  changing  variables. 
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and  the  other  decreases  for  a  time,  until  both  quanti¬ 
ties  stop  changing  simultaneously.  Given  these  data, 
the  system  discovers  that  the  transition  from  the  sec¬ 
ond  state  to  the  third  state  occurs  when  La  =  Lb, 
that  is,  when  the  system  reaches  equilibrium.  Transi¬ 
tion  laws  can  also  involve  simple  constants,  as  in  the 
case  of  melting  and  boiling  points.  Such  relations  let 
Ids  implicitly  specify  inequalities,  since  they  state  that 
certain  changes  continue  to  occur  until  the  transition 
conditions  are  met. 

The  system  uses  a  forward  propagation  method  in 
order  to  find  numeric  laws  between  qualitative  states. 
It  first  attempts  to  relate  a  quantity  in  the  current 
state  to  the  quantities  in  the  immediate  successor 
state.  If  Ids  cannot  infer  a  law  between  a  state  and 
its  inunediate  successor,  it  looks  for  a  numeric  rela¬ 
tion  between  the  state  and  the  successor  of  the  suc¬ 
cessor,  continuing  this  chain  until  it  finds  a  relation 
or  it  reaches  a  state  with  no  successor.^  Nordhausen 
(1989)  presents  numerous  examples  of  across-state  re¬ 
lations  that  Ids  discovers,  including  Black’s  law  of  spe¬ 
cific  heat,  the  chemical  law  of  combining  weights,  and 
Archimedes’  principle  of  displacement.  For  instance, 
in  the  fluid-flow  domain  described  above,  the  system 
finds  a  numeric  relation  Lf  =  ^La  +  I,  where  La  is 
the  initial  level  of  container  a,  Lf  is  the  final  level, 
and  I  is  constant  for  a  given  value  of  Ij,  the  initial 
level  of  container  6.  Moreover,  IDS  finds  a  second  law 
that  holds  at  a  more  abstract  level  of  its  state  and  Imk 
hierarchy:  I —^Lb-  Taken  together,  these  expressions 
can  be  rewritten  as  Lf  =  \La  +  \Lb,  giving  a  more 
general  law  relating  the  three  quantities. 

3  Experimental  Studies  of  Numeric 
Discovery 

In  this  section,  we  evaluate  the  robustness  of  Ids’  nu¬ 
meric  discovery  component  using  artificial  domains. 
We  first  examine  the  effect  of  varying  certain  system 
parameters  and  then  investigate  how  different  levels  of 
noise  in  numeric  data  influence  the  system’s  predictive 
accuracy.  Finally,  we  study  the  effects  of  irrelevant  at¬ 
tributes  and  law  complexity  on  the  learning  rate. 

3.1  Influence  of  the  Correlation  Threshold 

Ids’  numeric  discovery  component  incorporates  four 
system  parameters.  The  beam  width  determines  the 
memory  size  used  in  the  search  process,  whereas  the 
maximum  degree  of  terms  and  the  maximum  depth  of 
the  search  tree  limit  the  amount  of  search.  Experience 
has  shown  that  these  parameters  do  not  have  a  sub¬ 
stantial  influence  on  the  system,  provided  they  are  suf¬ 
ficiently  large;  for  the  experiments  reported  below,  we 
used  five  as  the  value  for  the  beam  size  and  the  max- 
imum  degree,  and  12  as  the  value  for  the  maximum 
depth.  However,  the  fourth  parameter,  the  correlation 

^This  search  process  would  be  expensive  in  complex  do¬ 
mains,  but  the  worst-case  cost  should  increase  only  as  the 
square  of  the  length  of  the  histories. 


threshold,  does  impact  the  behavior  of  the  system.  Ids 
uses  this  threshold  value  as  a  termination  criterion  to 
end  its  search  for  numeric  terms.  For  example,  con¬ 
sider  the  law  a:  =  a*.  If  the  data  contain  no  noise,  then 
the  correlation  between  x  eind  is  one.  However,  if 
there  is  noise  in  either  x  or  a,  then  the  correlation  be¬ 
tween  X  and  a*  will  never  be  perfect.  Hence,  if  the 
numeric  data  are  likely  to  contain  noise,  then  the  cor¬ 
relation  threshold  must  be  set  to  a  lower  value. 

In  order  to  determine  an  optimal  value  for  the  corre¬ 
lation  threshold  in  the  presence  of  noisy  data,  we  per¬ 
formed  an  experiment  in  which  we  varied  this  param¬ 
eter.  We  randomly  generated  values  for  six  numeric 
attributes  (a,  b,  c,  d,  e,  and  a:)  obeying  the  relation 
a:  =  2.0  X  ^  -f  30.0.  The  range  of  variables  o  through 
e  was  10.0  to  100.0,  and  values  of  a;  ranged  from  30.2 
to  20030.0.  Attribute  e  was  not  used  in  the  law,  and 
thus  was  an  irrelevant  attribute.  Furthermore,  we  in¬ 
troduced  noise  into  the  values  of  x,  using  constant  in¬ 
accuracy  with  er  =  100.0,  as  described  in  Section  3.2. 
We  gave  Ids  these  data  with  the  goal  of  finding  a  law 
that  related  x  to  the  other  numeric  variables. 

Because  Ids  is  incremental,  it  generates  a  hypoth¬ 
esis  for  the  law  as  it  incorporates  each  instance.  We 
then  used  this  hypothesis  to  measure  the  accuracy  of 
prediction  using  an  independent  nois^free  test  set.  We 
chose  noise-free  rather  than  noisy  test  data  to  ineasure 
the  accuracy  of  prediction  because  the  former  allows  a 
standard  control  to  compare  the  different  experimen¬ 
tal  conditions.  In  this  experiment,  we  measured  the 
absolute  difference  between  the  predicted  and  actual 
values  of  x  after  every  three  instances  (for  efficiency 
reasons)  and  recorded  the  average  difference  over  30 
test  instances.  Ids  did  not  incorporate  the  instances 
of  the  test  set,  but  only  used  them  to  measure  the 
accuracy  of  the  current  hypothesis. 

Figure  3a  displays  the  learning  curves  for  three  dif¬ 
ferent  values  of  the  correlation  threshold  (0.99,  0.98, 
and  0.97)  averaged  over  30  different  runs.  When  the 
threshold  was  0.99,  Ids  needed  at  most  lb  instances 
to  find  the  correct  term  (^).  It  then  continued  to 
adjust  the  slope  and  intercept  of  the  linear  relation, 
which  slowly  approached  the  correct  values  and  pro¬ 
vided  increasing  predictive  accuracy.  When  the  cor¬ 
relation  y/as  0.98,  the  system  found  the  correct  term 
in  all  but  one  run  after  at  most  30  instances.®  Us¬ 
ing  a  correlation  threshold  of  0.97,  the  system  found 
the  correct  term  in  26  out  of  30  runs  after  it  had  pro¬ 
cessed  75  instances.  As  Ids  observes  more  instances, 
5v  VtiH  eventually  find  the  correct  term  and  predict  the 
value  with  greater  accuracy.  Thus,  it  appears  that 
the  higher  correlation  thresholds  yield  better  learning 
rates.  However,  we  also  ran  the  same  data  with  0.999 
as  the  threshold  value.  In  many  runs.  Ids  failed  to  find 
a  term  that  has  a  correlation  higher  than  0.999  within 
75  training  instances,  for  reasons  that  we  will  explain 
in  Section  3.2. 


®Tiic  one  incorrect  term  is  the  reason  for  the  difference 
(after  30  instances)  between  the  learning  curves  when  the 
threshold  is  0.99  and  when  the  threshold  is  0.98. 
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Figure  3.  (a)  The  effect  of  varying  IDS’  correlation  threshold  on  predictive  ability. 

(b)  The  influence  of  noise  due  to  constant  inaccuracy  on  predictive  ability. 


Hence,  the  value  of  the  correlation  threshold  impacts 
the  behavior  of  the  system.  If  the  value  is  set  too  high, 
Ids  will  fail  to  find  a  term  with  an  acceptable  level  of 
correlation.  If  the  level  is  too  low,  the  learning  rate  de¬ 
creases.  However,  we  will  show  in  the  next  section  that 
one  correlation  threshold  value  can  be  used  with  data 
over  a  wide  range  of  noise  levels.  We  will  use  a  value 
of  0.99  as  the  correlation  threshold  for  the  remainder 
of  this  paper. 

3.2  The  Influence  of  Noise 

Much  of  the  early  work  on  empirical  discovery  assumed 
that  data  were  noise  free  (e.g.,  Lenat,  1978;  Langley  et 
al.,  1983).  This  is  clearly  a  simplification,  and  in  this 
section  we  study  how  different  levels  of  noise  influence 
the  behavior  of  Ids’  numeric  discovery  algorithm.  All 
numeric  data  in  science  ultimately  come  from  mea¬ 
surement  instruments,  and  one  primary  source  of  er¬ 
ror  stems  from  the  inaccuracy  of  the  instruments  or 
the  operator  handling  the  instrument. 

We  considered  two  kinds  of  measurement  inaccurar 
cies:  constant  and  relative.  For  example,  a  thermome¬ 
ter  that  is  calibrated  to  ±0.5®  measures  all  values  on 
its  scale  to  within  one  degree  of  accuracy  regardless 
of  the  value  of  the  measured  quantity.  However,  the 
expected  error  also  can  depend  on  the  magnitude  of 
the  measured  quantity.  For  example,  the  accuracy  of 
a  voltmeter  might  be  ±0.5%.  That  is,  if  the  volt¬ 
age  measures  10  volts,  we  can  expect  this  measure¬ 
ment  to  be  accurate  within  ±  0.5%  of  10  volts,  or  0.1 
volts.  However,  if  the  voltmeter  displays  a  value  of  100 
volts,  then  this  measurement  is  only  accurate  within 


1.0  volts.  In  domains  involving  the  first  type  of  in¬ 
accuracy  (constant),  it  seems  appropriate  to  evaluate 
predictive  ability  in  terms  of  absolute  error;  in  domains 
involving  the  second  type  of  inaccuracy  (relative),  one 
should  instead  measure  the  percentage  error. 

To  study  the  effect  of  noise  due  to  constant  inac¬ 
curacy,  we  gave  Ids  the  same  data  as  in  the  previ¬ 
ous  experiment:  randomly  generated  values  of  six  nu¬ 
meric  attributes  (a,  6,c,  d,e,  ®)  that  obeyed  the  rela¬ 
tion  X  =  2.0  X  ^  ±  30.0.  We  then  corrupted  each 
value  of  X  by  replacing  its  value  with  a  sample  taken 
from  a  normal  distribution  with  a  mean  of  x  and  a 
standard  deviation  of  cr.  In  each  experimental  condi¬ 
tion  we  used  a  different  value  of  cr.  We  did  not  corrupt 
the  values  of  the  other  attributes  even  though  it  is  un¬ 
realistic  to  assume  real-word  noise  occurs  only  in  one 
attribute.  However,  this  allows  for  a  far  better  com¬ 
parison  of  different  levels  of  noise  in  the  data,  because 
the  amount  of  noise  is  independent  of  the  function. 
As  in  the  previous  experiment,  we  used  an  indepen¬ 
dent  test  set  of  30  noise-free  instances  to  measure  the 
average  (absolute)  accuracy  of  prediction  over  30  dif¬ 
ferent  runs.  The  value  of  the  correlation  threshold  for 
this  and  all  other  experiments  was  0.99. 

Figure  3b  shows  the  learning  curves  for  Ids  as  the 
noise  levels  vary.  The  learning  rate  does  not  differ  sub¬ 
stantially  for  different  amounts  of  noise.  In  most  runs, 
it  took  the  system  less  than  25  instances  to  find  the 
correct  term  regardless  of  the  noise  level.  However, 
the  noise  level  influences  the  estimates  for  the  slope 
and  the  intercept,  which  determine  the  accuracy  after 
the  correct  term  is  found.  We  should  note  that  when 
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the  noise  level  was  at  cr  =  250.0,  Ids  often  could  not 
find  a  law  using  the  training  set.  Given  only  75  data 
points,  even  the  correlation  between  x  and  the  correct 
numeric  term  was  often  less  than  0.99  because  of  the 
noise  in  the  data.  However,  if  we  set  the  correlation 
threshold  to  a  lower  value  or  used  more  training  in¬ 
stances,  then  the  system  found  the  correct  term  in  all 
runs. 

In  dealing  with  constant  inaccuracy.  Ids  uses  a  least 
squares  regression  algorithm  to  estimate  the  slope  and 
intercept,  which  minimizes  the  sum  of  the  squares 
of  the  deviations  over  all  data  points  and  thus  also 
minimizes  the  variance.  However,  relative  inaccuracy 
causes  the  expected  error  to  increase  as  the  magnitude 
of  the  attribute  becomes  larger,  so  that  deviations  are 
not  equally  distributed  over  the  range  of  values.  Thus, 
if  the  expected  error  is  proportional  to  the  magnitude 
of  the  value  and  the  values  are  spread  over  a  wide 
range,  then  the  basic  least  squares  method  will  essen¬ 
tially  ignore  the  small  values  in  its  estimation  and  pro¬ 
duce  poor  estimates  for  the  parameters  in  the  linear 
relation. 

In  order  to  avoid  this  problem.  Ids  uses  a  variant 
of  the  weighted  least  squares  method  (Boz,  Hunter,  & 
Hunter,  1978);  to  calculate  the  parameters  of  a  linear 
relation  if  the  expected  error  is  proportional  to  the 
magnitude  of4he  attribute.  In  such  cases,  this  variant 
produces  better  estimates  for  the  slope  and  intercept 
than  the  regular  least  squares  method.  Nordhausen 
(1989)  reports  an  experiment  similar  to  the  one  above, 
in  which  Ids  uses  this  second  method  to  discover  laws 
with  different  levels  of  relative  inaccuracy.  As  before, 
the  learning  rate  (with  percentage  error  as  the  depen¬ 
dent  variable)  decreased  only  slightly  as  the  noise  level 
increased,  and  in  most  runs  Ids  took  less  than  25  in¬ 
stances  to  find  the  correct  term.  Again  the  error  rate 
affected  the  predictive  accuracy  once  the  correct  term 
was  found.  Furthermore,  the  data  with  a  10%  noise 
level  proved  too  noisy  in  some  cases,  and  the  system 
fculed  to  discover  any  law  in  five  out  of  the  30  runs. 

Whether  the  noise  results  from  constant  or  rela¬ 
tive  inaccuracy  of  the  measurement  instruments.  Ids’ 
numeric  discovery  component  proved  to  be  robust. 
Correlation  analysis  seems  to  be  a  good  heuristic  to 
guide  the  search  through  Ids’  space  of  possible  numeric 
terms,  even  in  the  presence  of  noise.  Once  the  system 
finds  the  correct  term,  the  noise  affects  the  estimation 
of  the  parameters  of  the  linear  relation,  but  as  the  sys¬ 
tem  receives  more  instances,  the  regression  algorithm 
produces  estimates  of  increasing  accuracy.  Moreover, 
we  used  the  same  correlation  threshold  value  for  all 
conditions  in  the  two  experiments  reported  in  this 
section.  These  experiments  demonstrate  that,  even 
though  the  value  of  the  correlation  threshold  affects 
the  behavior  of  Ids,  the  system  is  not  overly  sensi¬ 
tive  to  this  parameter  and  that  the  same  correlation 
threshold  value  can  be  used  for  data  covering  a  wide 
range  of  noise. 


3.3  The  Influence  of  Irrelevant  Attributes 

In  the  real  world,  one  cannot  always  determine  rel¬ 
evance  of  an  attribute  a  priori.  In  the  previous 
experiments,  we  always  included  one  irrelevant  at¬ 
tribute  in  the  data.  In  another  experiment,  we  in¬ 
vestigated  how  irrelevant  attributes  affect  Ids’  learn¬ 
ing  rate.  We  again  gave  the  system  randomly  gener¬ 
ated  data  of  numeric  attributes  obeying  the  function 
a:  =  2.0  X  ^+30.0.  We  corrupted  the  values  of  x  using 
noise  due  to  constant  inaccuracy  (with  <t  =  100.0)  as 
described  in  the  previous  section.  However,  in  this  ex¬ 
periment  we  varied  the  number  of  irrelevant  attributes 
included  in  the  data  from  zero  to  20. 

As  before,  we  gave  the  system  30  different  data  sets 
for  each  experimental  condition,  and  measured  the 
absolute  difference  between  the  actual  and  predicted 
value.  Figure  4a  shows  that  the  number  of  irrelevant 
attributes  only  increases  the  number  of  observations 
required  to  find  the  correct  term.  Even  with  20  irrel¬ 
evant  attributes.  Ids  found  the  correct  term  within 
at  most  24  instances.  As  the  number  of  irrelevant 
attributes  increases,  the  probability  of  an  accidental 
correlation  between  x  and  a  term  involving  irrelevant 
attributes  becomes  higher.  However,  given  enough 
instances  the  numeric  discovery  component  finds  the 
correct  term,  so  that  irrelevant  attributes  affect  only 
the  learning  rate.  Once  the  system  has  identified  the 
correct  term,  it  ignores  the  irrelevant  attributes,  and 
the  predictive  accuracy  becomes  identical  in  all  cases. 
Thus,  the  numeric  learning  component  is  robust  even 
in  the  presence  of  many  irrelevant  variables. 

3.4  The  Influence  of  Function  Complexity 

Finally,  in  order  to  test  the  influence  of  function  com¬ 
plexity,  we  ran  the  system  on  data  obeying  different 
laws  in  which  we  varied  the  number  of  variables  and 
their  degree.  In  this  experiment,  we  randomly  gener¬ 
ated  data  for  numeric  attributes  obeying  one  of  three 
laws:  X  =  2.0  X  a5  +  30.0,  x  =  2.0  x  ^  +  30.0,  or 
X  =  2.0  X  ^  +  30.0.  These  functions  were  chosen 
arbitrarily  but  increase  in  complexity.  As  before,  we 
included  one  irrelevant  attribute  in  each  of  the  three 
data  sets.  The  values  for  attributes  a  through  d  ranged 
from  10.0  to  100.0.  Because  of  the  different  terms,  the 
values  of  x  ranged  over  a  different  interval  for  each 
function.  Thus,  we  chose  to  introduce  noise  from  rel¬ 
ative  inaccuracy  (with  a  value  of  cr  =  0.75%)  rather 
than  constant  inaccuracy,  so  the  level  of  noise  was  com¬ 
parable  for  each  data  set.  Hence,  ’.vc  used  percentage 
error  rather  than  the  absolute  difference  as  the  depen¬ 
dent  measure  in  this  experiment. 

Figure  4b  shows  the  learning  curves  for  the  three 
functions  averaged  over  30  random  data  sets,  each 
containing  75  instances.®  For  the  simplest  function, 

*The  reason  for  the  aberrations  in  the  learning  curves 
can  be  attributed  to  the  runs  in  which  the  system  replaces 
an  incorrect  term  with  an  improved  term  that  describes 
the  processed  training  data  better,  but  which  actually  docs 
worse  than  the  replaced  term  on  the  test  data. 
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Figure  4.  (a)  The  influence  of  irrelevant  attributes  on  predictive  ability. 

(b)  The  influence  of  the  function  complexity  on  predictive  ability. 


Ids  found  the  correct  term  rather  quickly,  requiring  at 
most  nine  instances.  There  are  two  reasons  the  sys¬ 
tem  needs  more  instances  to  discover  the  term  as  the 
complexity  of  the  function  increases.  As  we  showed  in 
the  previous  experiment,  when  the  number  of  observed 
attributes  increases,  then  the  probability  for  acc’  ie.n- 
tal  correlations  between  x  and  an  incorrect  terra  aijo 
increases.  Furthermore,  as  the  function  becomes  more 
complex,  the  path  for  the  correct  term  through  the 
space  of  possible  numeric  terms  increases  in  length. 
However,  once  Ids  finds  the  correct  term,  the  pre¬ 
dictive  accuracy  is  similar  regardless  of  the  function’s 
complexity.  This  suggests  that  function  complexity 
only  affects  the  number  of  instances  required  to  find 
the  correct  term,  but  not  the  predictive  accuracy  once 
this  term  is  found. 

4  Discussion 

In  closing,  we  should  attempt  to  draw  some  conclu¬ 
sions  from  our  experiences  with  Ids.  Below  we  discuss 
some  strengths  and  limitations  of  the  numeric  algo¬ 
rithm,  followed  by  the  advances  we  believe  our  research 
has  made  over  earlier  approaches  to  discovery. 

4.1  Strengths  and  Limitations 

In  the  previous  section,  we  reported  a  number  of  exper¬ 
iments  with  Ids’  numeric  discovery  component.  The 
studies  produced  encouraging  results  on  the  use  of  cor¬ 
relation  analysis  as  a  heuristic  to  guide  the  search  for 
constant  numeric  terms.  We  saw  that  Ids  is  not  overly 
sensitive  to  the  setting  of  the  correlation  threshold, 
and  that  it  can  use  a  single  value  to  discover  relations 


of  data  with  varying  degrees  of  noise.  In  addition,  the 
system  tolerates  noise  that  results  from  both  constant 
and  relative  inaccuracy  of  measurement  instruments. 
One  must  adjust  the  regression  algorithm  for  the  two 
kmds  of  noise,  but  it  seems  reasonable  to  assume  one 
knows  the  kind  of  error  a  measurement  instrument  pro¬ 
duces.  Furthermore,  we  ran  experiments  that  showed 
Ids  tolerates  irrelevant  variables  well.  In  these  stud¬ 
ies,  the  learning  rate  was  hardly  affected  even  when  a 
large  percentage  of  the  observed  attributes  were  irrel¬ 
evant.  Finally,  we  showed  that  the  learning  rate  de¬ 
creases  only  slowly  as  the  complexity  of  the  function 
describing  the  data  increases. 

However,  one  should  treat  these  results  with  cau¬ 
tion,  because  there  are  some  well-known  problems  with 
correlation  and  regression  analysis  (Boz  et  al.,  1978). 
For  example,  if  a  numeric  attribute  has  only  a  limited 
range,  then  correlation  analysis  will  fail  to  detect  a 
law  including  that  attribute.  Furthermore,  although 
we  examined  the  effects  of  noise,  irrelevant  terms,  and 
complexity  in  isolation,  we  did  not  consider  the  pos¬ 
sibility  of  interactions  among  these  factors.  It  seems 
plausible  that  Ids  would  encounter  difficulty  m  noisy 
domains  with  complex  functions  and  many  irrelevant 
terms.  Moreover,  since  the  numeric  discovery  process 
is  embedded  within  Ids’  mechanisms  for  taxonomy  for¬ 
mation  and  qualitative  discovery,  it  relies  on  the  suc¬ 
cessful  operation  of  these  mechanisms  to  obtain  useful 
results.  The  experiments  described  above  tested  the 
numeric  component  in  isolation,  rather  than  in  the 
context  of  the  complete  system.  Finally,  we  have  yet 
to  determine  the  algorithm’s  behavior  on  real-world 
data.  Future  studies  should  address  all  of  these  issues. 
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4.2  Ad'!'an»*.es  Over  Previous  Work 

Ids  finds  nume.'ic  laws  similar  in  form  to  those  pro¬ 
duced  by  Bacon  (Langley  et  al.,  1983),  Abacus 
(Falkenhainer  &  Michalski,  1986),  and  Fahrenheit 
(Langley  &  Zytkow,  1989).  Moreover,  they  employ 
similar  methods  to  control  their  search  for  useful  nu¬ 
meric  terms,  using  correlation-like  techniques  to  focus 
attention.  However,  these  systems  differ  in  the  de¬ 
tails  of  their  search  control.  Bacon  and  Fahrenheit 
employ  a  heuristic  form  of  depth-first  search,  focus¬ 
ing  on  more  recently  defined  terms.  Abacus  creates  a 
proportionality  graph  to  detennine  promising  combi¬ 
nations  of  terms,  then  use.s  a  modified  beam  search  to 
find  the  best  combinations.  Ids  carries  out  a  similar 
but  less  sophisticated  beam  search,  relying  on  corre¬ 
lation  analysis  to  guide  its  steps  through  the  space  of 
numeric  terms.  With  respect  to  robustness.  Ids  pro¬ 
vides  a  clear  advance  over  Bacon  i.td  Fahrenheit, 
which  used  simple  ‘trend  detectoi(=’  ^sther  than  corre¬ 
lation,  giving  them  only  limited  abi  ities  for  handling 
noise  and  irrelevant  terms.  The  relation  to  Abacus 
on  this  front  is  less  clear,  but  future  studies  should 
compare  the  behavior  of  these  algorithms. 

Another  difference  be-.ween  Ids  and  previous  nu¬ 
meric  discovery  systenrj  is  Its  incremental  approach 
and  its  reuse  of  existing  terms.  The  current  implemen¬ 
tation  requires  reuse  of  all  previous  instances  stored 
at  a  given  node  or  link,  but  a  version  tl.<^.’;  jornputed 
the  correlation  scores  incrementally  would  be  consid¬ 
erably  more  efficient  than  a  nonincremental  version 
for  domains  in  which  observations  are  made  a  few  at 
a  time.  Moreover,  the  reuse  of  existing  terms  should 
have  a  significant  impact  in  domains  involving  many 
variables. 

However,  the  most  significant  difference  between  Ids 
and  earlier  systems  concerns  the  statement  of  con¬ 
ditions  on  the  discovered  laws.  Bacon,  Fahren¬ 
heit,  and  Abacus  all  include  conditional  statements 
on  their  numeric  relations,  but  these  statements  con¬ 
tain  little  information  about  the  structure  or  physical 
situation  in  which  the  laws  occur.  Even  as  simple  a 
relation  as  the  ideal  gas  law  actually  involves  a  set 
of  structurally-related  objects  that  change  over  time, 
which  Ids  can  represent  using  a  single  abstract  qual¬ 
itative  state.  More  complex  laws  involve  sequences 
of  qualitative  states  that  together  specify  qualitative 
laws,  as  in  the  laws  of  fluid  fiow  and  in  the  generaliza¬ 
tion  that  acids  react  with  alkalis  to  produce  salts.  In 
each  case,  the  system  annotates  these  qualitative  sum¬ 
maries  with  numeric  relations,  providing  deteiil  beyond 
that  usually  included  in  qualitative  representations. 

Simultaneously,  Ids’  taxonomy  of  qualitative  states 
and  its  abstract  successor  links  specify  a  context  for 
the  quantitative  laws  it  discovers.  These  constrain 
both  the  situations  in  which  the  numeric  laws  are 
applied  and  the  search  for  relations  among  numeric 
quantities,  since  a  law  makes  sense  only  when  embed¬ 
ded  within  some  qualitative  description.  Moreover, 
as  Nordhausen  (1989)  describes  in  detml,  the  explicit 
representation  of  state  transitions  and  time  support 


the  discovery  of  terms  and  laws  that  other  systems 
cannot  handle.  Thus,  Ids  defines  the  boiling  point  of 
a  substance  as  an  intrinsic  property  associated  with 
a  law  describing  the  transition  between  states.  Also, 
by  treating  the  duration  of  a  qualitative  state  as  an 
explicit  quantity,  the  system  can  discover  the  law  gov¬ 
erning  radioactive  decay,  in  which  the  ‘rate  of  reaction’ 
depends  on  the  amount  of  material.  In  summary.  Ids’ 
augmented  representation  allows  some  significant  ad¬ 
vances  over  previous  approaches  to  numeric  discovery. 
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Abstract 

The  refinement  of  knowledge  bases  is  an 
important  activity  in  the  expert  system 
lifecycle.  Belief  networks  have  been  pro¬ 
posed  and  studied  as  an  alternative  to  rules 
in  expert  system  knowledge  bases.  The 
proUems  of  synthesis  and  refinement  of 
belief  networks  arising  from  the  develop¬ 
ment:  of  a  large  belief-network-bas^ 
knowl^e-based  system,  are  presented  and 
formally  analyzed.  We  prove  that,  for  very 
simple  Dempster-Shafer  networks  (trees), 
defining  parameter  values  (synthesis)  is 
NP-Complete.  Additional  results  that  are 
given  without  proof  include  the  computa¬ 
tional  intractability  of  refining  expert- 
estimated  values  (refinement),  even  when 
we  settle  for  approximate  values  or  demand 
agreement  on  only  a  certain  percentage  of 
cases.  The  potential  impact  of  these  results 
on  the  practice  of  expert  system  construc¬ 
tion  is  discussed. 

1.  Introduction 

Knowledge  base  refinement  is  concerned  with 
modifications  to  a  knowledge  base  that  increase  its 
breadth  and  accuracy. 

Most  published  work  in  knowledge  base 
refinement  Im  addressed  truth-functional  rule  bases. 
From  the  point  of  view  of  knowledge-base  refinement, 
truth-functional  (cm*  extensional)  and  non-truth- 
functional  (or  intensional)  systems  are  distinctly 
different  [Ruspini,  1982;  Pearl,  1988].  A  ruli^based 
expert  system  is  truth-functional  if  the  belief  associated 
with  a  pre^sition  in  the  system  depends  only  on  the 
belief  in  propositions  that  ^pear  in  the  premise  of 
rules  that  conclude  the  original  pre^sition,  with  an 
obvious  exception  for  the  propositions  that  are  not  con¬ 
cluded  by  any  rule.  In  his  contribution  to  the 
knowledge-base  refinement  track  of  last  year’s  Interna¬ 
tional  Workshop  on  Machine  Learning,  Valtorta 
[1989c]  address^  the  computational  complexity  of 


truth-functional  rule  base  refinement. 

We  now  report  new  results  that  extends  the  com¬ 
plexity  analysis  of  hiowledge  base  refinement  to  sys¬ 
tems  that  are  not  truth-functional,  viz.  belief  networlu. 

Belief  networics  are  gaining  popularity  as  a  for¬ 
malism  for  implementing  knowledge  bases  for  expert 
systems.  With  respect  to  the  more  common  MYCIN- 
style  ride  bases,  belief  networks  overcome  the  prob¬ 
lems  arising  firom  a  truth-functional  approach  to  evi¬ 
dence  propagation,  by  adopting  a  model-based  (or 
intensional)  approach,  as  explained,  e.g.,  in  [Pearl, 

1988,  ch^ter  1].  The  use  of  a  sounder  iqjproach  to 
handling  uncertainty  has  its  drawback  in  the  worst-case 
inefficiency  of  belief  computation  in  belief  networks 
[Cooper,. 1988;  Provan,  1989]. 

A  key  feature  of  belief  networks  is  their  use  of 
numerical  parameters.  These  parameters  are  probabil¬ 
ity  masses  in  Dempster-Shafer  networks  (see  [Smets, 
1988]  for  a  brief  introduction)  and  conditional  proba¬ 
bilities  (or  •  'ted  parameters,  such  as  likelihood  ratios) 
in  Bayesian  .Jiwoiks  (see  [Pearl,  1988]  for  the  canoni¬ 
cal  treatment]).  These  numerical  parameters  can  be 
estimated  using  statistical  techniques.  However,  this 
may  be  practically  unfeasible,  and  developers  must 
resort  to  knowledge  engineering  techniques,  as  docu¬ 
mented  in  the  consuuction  of  MUNIN. 

There  are  only  very  few  large  applications  of 
belief  networks.  Of  these,  probably  the  best  known  is 
MUNIN  [Andreassen  et  al,  1987],  As  of  mid-June 

1989,  MUNIN  had  grown  to  a  network  of  approxi¬ 
mately  1000  nodes.*  "Estimating  the  270  conditional 
probabilities  [in  the  MUNIN  belief  network  of  1987] 
(...)  would  require  at  least  10000  cases.  Instead  of 
relying  on  this  empirical  approach,  we  have  tried  to 
rely  as  much  as  possible  on  'deep  knowledge’,  using 
an  understanding  of  patophysiological  processes  as 
expressed  in  medical  textbooks  and  papers" 
[Andreassen  et  al.,  1987,  p.369].  In  any  case,  whatever 


*Steen  Andreassen,  personal  communication.  MUNIN  has 
been  developed  in  the  context  of  ESPRIT  project  599. 
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the  source  of  the  numerical  parameters,  they  need  to  be 
refined.  In  MUNIN,  "Discrepancies  between  the  net¬ 
work  and  the  medical  experts  lead  to  revision  of  the 
model  parameters  (...).  Occasionally,  it  may  even  be 
necessary  to  modify  the  structure  of  the  network, 
adding  or  deleting  states  or  nodes"  [Andreassen  et  al., 
1987,  p.370].  Discrepancies  between  the  network  and 
the  expert  are  uncovered  by  comparing  the  belief 
assigned  to  particular  variables  in  particular  nodes  by 
the  expert  and  by  the  network,  when  the  same  evidence 
is  presented  to  the  expert  and  to  the  networic.  In 
MUNIN,  the  evidence  typically  describes  the  findings 
for  a  patient,  while  the  particular  variables  correspond 
to  possible  diagnoses. 

This  paper  addresses  the  problem  of  synthesis 
and  refinement  of  numerical  parameters  in  Dempster- 
Shafer  belief  networks  from  an  algorithmic  standpoint. 
The  author  has  obtained  analogous  results  for  Bayesian 
networks  [Valtorta  and  Loveland,  1989].  Section  2 
formalizes  the  problem  already  described  in  this  intro¬ 
duction,  by  defining,  among  others,  the  notion  of  case. 
In  section  3  we  show  that  the  synthesis  of  masses  in 
Dempster-Shafer  networks  from  cases  is  NP-Complete. 
Additional  results,  including  the  proof  that  the 
refinement  of  masses  in  Dempster-Shafer  networks 
from  cases  is  NP-Complete,  and  some  settings  involv¬ 
ing  approximations,  are  presented  (without  prooQ  in 
section  4.  Section  5  discusses  related  work.  Section  6 
concludes  the  paper  with  an  assessment  of  results. 

The  proofs  that  are  not  in  the  ptpsr  and  much 
additional  material  can  be  found  in  [Valtorta  and  Love¬ 
land,  1989],  which  has  been  submitted  for  journal  pub¬ 
lication. 

2.  Formalizing  Mass  Refinement 

Without  loss  of  generality,  since  we  are  after 
lower  bound  results,  we  consider  Dempster-Shafer  net¬ 
works  (from  now  on,  DS-nets)  in  the  form  of  a  tree 
(and  call  them,  simply,  DS-trees).  There  arc  a  consid¬ 
erable  number  of  different  versions  of  DS-trees  in  the 
literature,  all  related  in  rather  straightforward  ways.  In 
this  paper,  alternating  Markov  trees  are  used.  Our 
definition  is  adapted  from  [Mellouli,  1988,  [^.66,85]. 
A  (qualitative)  Markov  tree  of  variables  in  a  set  S  is  a 
ttee  T  =  (N,  E),  such  that  N  is  a  subset  of  the  power 
set  of  S  (i.e.,  the  nodes  of  the  tree  arc  subsets  of  S) 
and  such  that  the  iritersccticn  of  two  nodes  iij  uiid  1I2  is 
contained  in  node  n^  if  n^  lies  between  n,  and  in 
some  branch  of  T.  A  Markov  tree  T  =  (N3)  is  (dter- 
nating  if  every  node  in  the  tree  is  either  contained  in 
all  its  neighbors  or  contains  all  its  neighbors.  (For 
example,  the  uee  in  Figure  1  is  an  alternating  Markov 
tree  for  variables  in  the  set  S  =  (Cj,  c^,...,  c^,,  u].) 


For  simplicity  (and  again,  without  loss  of  gen¬ 
erality),  each  of  the  variables  in  the  DS-tree  will  be 
assumed  to  be  two-valued.  Call  the  two  values  0  and 
1.  We  will  express  the  mass  assigned  to  a  subset  of 
the  frame  of  discernment^  of  variable  a  as  m(a=Vj), 
where  Vj  =  0,1,  or  as  m(lj),  where  1^  =  a,  "a.  The  mass 
assigned  to  the  frame  of  discernment  of  variable  a  will 
be  indicated  as  m(6(a))  or,  in  case  there  is  no  ambi¬ 
guity,  simply  as  m(0).  For  joint  variables,  we  will 
indicate  a  subset  of  the  frame  of  discernment  by  a  pro- 
positional  formula.  For  example,  consider  the  joint 
variable  (a,b)  whoso  frame  of  discernment  is  6((a,b))  = 
{(a=0,b=0}.  {a=0,b=l},  {a=l,W)},  {a=l,b=l)).  The 
mass  assigned  to  the  subset  ((a4),b=0},  (a^,  b=l}, 
(a=l,  bsl}}  will  be  indicated  as  m('a  v  b),  or  as  m(a 
=>  b).  The  mass  assigned  to  the  frame  of  discernment 
of  the  joint  variable  (a,b)  will  be  indicated  as 
m(6((a,b)))  or,  in  case  there  is  no  ambiguity,  simply  as 
m(6).  FoUowing  Pearl  [1988,  p.  418],  we  call  aisubsct 
of  the  frame  of  discernment  such  as  ("a  v^b)  z  compa¬ 
tibility  relation.  (Intuitively,  m("a  v  b)  quantifiM?the 
constraint  that  a  is  not  compatible  with  ~b.) 

A  DS-presentation  is  defined  as  a  triple:  consist¬ 
ing  of  a  DS-tree,  a  set  of  compatibility  relatiohs,  and 
an  assignment  of  masses  to  some  of  Ote  compatibiUty 
relations  and  their  frames  of  discenm_Mt.  Figure  1 
shows  a  DS-presentation  when  values  Sj,  $2....^$^  are 
fixed.  DS-presentations  will  be  considered  aS'tealizing 
a  function  from  the  vector  of  masses  assigne^rU)  the 
leaf  nodes  of  a  DS-tree  to  a  belief  (simply ‘admass  for 
the  nets  that  we  consider  in  this  p^r)  for  the  root 
node  (via  a  process  involving  Dempster’s  rule  and 
summarized  later).  A  point  in  the  graph  of  the^fiinc- 
tion  will  be  called  a  case.^  We  now  describe  hOw  this 
models  the  situation  described  by  Andreassen  et  al. 
[1987]  and  summarized  in  the  introduction.  The  output 
part  of  a  case  describes  the  desired  "tuiswer"  of  the 
DS-tree  when  "queried"  with  the  evidence  encoded  as 
the  input  part  of  the  same  case.  Typically,  there  will 
be  a  discrepancy  (output  error)  between  the  value  of 
the  belief  as  computed  by  the  tree  and  the  belief  given 
as  output  part  of  the  case.  This  discrej^cy  must  be 
eliminated  in  order  for  the  DS-tree  to  work  correctly. 

Before  addressing  the  task  of  refinement,  we 
address  in  the  next  section  the  more  basic  task  of 


^See,  e.g.,  [Smets,  1988]  or  [Gordon  and  Shoitliffe,  198S]  for 
the  definition.  Informally,  the  frame  of  diicemmoit  for  a  set  S  of 
vaiirMes  (where  all  variables  have  a  discrete  range  of  values)  is  the 
Cartesian  product  of  the  set  of  variable-value  pairs  for  all  variables  in 
S. 

through  of  Hgure  A  (in  the  Appendix)  are  examples  of 
cases.  Each  case  is  an  input-ou^t  pair. 
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parameter  instantiation  (assigning  values  to  the  parame¬ 
ters).  We  call  this  the  synthesis  task. 

3.  Initial  Mass  Synthesis  Is  NP-Hard 

The  problem  considered  in  this  section  is  a  prob¬ 
lem  of  synthesis,  rather  than  a  problem  of  refinement 
of  masses  in  DS-trees. 

Problem  name:  Mass  Synthesis  (MS). 

Problem  instance:  A  DS-tree  and  associated  com¬ 
patibility  relations  as  given  in  Figure  1;  a  set  of  cases. 

Question:  Is  there  an  assignment  of  values  to 
s,,...,s^,  such  that  the  function  realized  by  the  DS-tree 
satisfies  the  cases? 

Theorem  1 

MS  is  NP-Complete. 

The  proof  is  given  in  the  Appendbc. 

4.  Mass  Refinement  Is  NP-Hard 

Problem  name:  Mass  Refinement,  Search  Version 
(MRS). 


Problem  instance:  A  DS-tree  and  associated  com¬ 
patibility  relations  as  given  in  Figure  1;  an  assignment 

of  values^  for  Sj . s„;  a  positive  constant  e;  a  set  of 

cases. 

Question:  Find  an  assignment  of  values  to  S|,...,s^ 
each  of  which  is  at  most  e  away  from  the  given  assign¬ 
ment  and  such  that  the  function  realized  by  the  DS-tree 
satisfies  the  cases. 

MRS  is  NP-Hard  if  the  following  decision  prob¬ 
lem  is: 

Problem  name:  Mass  Refinement  (MR). 

Problem  instance:  A  DS-tree  and  associated  com¬ 
patibility  relations  as  given  in  Figure  1;  an  assignment 
of  values  for  S|,...,s^;  a  positive  constant  e;  a  set  of 
cases. 

Question:  Is  there  an  assignment  of  values  to 
s,,...,s^  each  of  which  is  at  most  e  away  from  the  given 


^Theie  vtluei  may  be  expeit-givm  or  otherwise  eitimited. 


m,(c,  =>  d)  =  s,,  m,(e)  =  1  -s, 
->  d)  =  s^,  m.(e)  =  1  -s. 

mj(c„=>d)  =  s„,  m„(e)  =  l-s„ 


Figure  1  DS-tree  and  associated  compatibility  relations  for  problem  MS. 
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assignment  and  such  that  the  function  realized  by  the 
DS-tree  satisfies  the  cases? 

Theorem  2 

MR  is  NP-Coinplete  for  any  fixed  e. 

We  now  present  three  problems  that  arise  from 
attempts  to  define  meaningful  approximate  solutions  to 
the  refinement  of  numerical  parameters. 

Problem  name:  Approximate  Mass  Synthesis 
(AMS). 

Problem  instance:  As  for  problem  MS  in  section 
3. 

Question:  Find  an  assignment  to  Sj,...,Sy  strictly 
within  0.5  of  the  correct  assignment,  where  the  correct 
assignment  is  the  assignment  such  that  the  function 
realized  by  the  DS-tree  satisfies  the  cases. 

Theorem  3 

AMS  is  NP-Hard. 

Problem  name:  Mass  Synthesis,  Bounded  output 
Error  (MSBE). 

Problem  instance:  A  DS-tree  and  associated  com¬ 
patibility  relations,  as  given  in  Figure  1;  a  set  of  cases; 
a  constant  e. 

Question:  Is  there  an  assignment  of  values  to 
Sj,...,s^  such  that  the  maximum  output  error  on  the 
cases  is  less  than  or  equal  to  e? 

Theorem  4 

MSBE  is  NP-Hard  for  all  e<=d<.5. 

Remark 

To  prove  Theorem  4,  we  need  some  technical 
conditions  on  the  possible  values  of  s,,...,S|j.  For  a 
different  kind  of  refinement  involving  bounc^  output 
errcH',  sec  the  discussion  at  the  beginning  of  the  next 
section. 

Problem  name:  Noisy  Mass  Refinement  (NMR). 

Problem  instance:  A  DS-tree  and  associated  com¬ 
patibility  relations  as  given  in  Figure  1;  an  assignment 
of  values  for  s,,...,s^;  a  positive  constant  e;  a  positive 
constant  k  less  than  100;  a  set  of  cases. 

Question:  Is  there  an  assignment  of  values  to 
s,,...,s^  each  of  which  is  at  most  e  away  from  the  given 
ones  a.nd  such  that  the  function  satisfies  k%  of  the 
cases? 

Theorem  5 

NMR  is  NP-Hard. 

5.  Related  Work 

In  applications  (such  as  diagnosis)  in  which  the 
task  (as  defined,  e.g.,  in  (Breuker  et  al.,  1987])  of  the 


expert  system  using  the  belief  network  is  to  classify, 
the  user  is  concerned  with  the  relative  ranking  (rather 
than  the  exact  values)  of  beliefs  associated  to  the  ter¬ 
minal  nodes*  of  a  DS-net  or  a  Ba-net  Cooper  (1988, 
sections  5,4  and  5.5]  and  Valtorta  [1989a,  1989c]  pro¬ 
vide  additional  motivation  and  techniques.  The 
corresponding  refinement  problems  are  NP-Hard. 

The  introduction  points  to  some  of  the  work  on 
rule  bases,  a  common  form  of  knowledge  base.  Syn¬ 
thesis  of  numerical  parameters  in  rule  bases  is  some¬ 
what  analogous  to  training  in  neural  netwodcs.  Indeed 
there  are  tq)parently  similar  results  of  intractability. 
(Cf.  (Judd,  1987,  1988;  Blum  and  Rivest,  1988;  Lin 
and  Vitter,  1989].)  However,  the  functions  used  in 
neural  networks  to  process  weights  are  different  firom 
those  used  in  rule  bases  or  belief  netwtxks.  (Cf.  [Fu, 
1988;  Valtorta,  1989a].)  As  done  in  much  of  the  neural 
network  training  work,  we  assume  that  the  structure  of 
the  net  is  fixed.  Typically,  it  is  given  by  the  expert. 
The  fixed  network  assumption  is  therefore  more 
appropriate  in  belief  network  refinement  than  in  neural 
network  training.  We  conjecture  that  refinement  of 
belief  networks  remains  NP-Hard  when  limited  changes 
to  the  structure  of  the  network  are  permitted. 

6.  Conclusion 

The  networks  used  in  the  problem  instance  of 
MS  and  MR  are  trees.  It  is  well  known  (e.g.,  [Kong, 
1986;  Pearl,  1988;  Lauritzen  and  Spiegelhalter,  1988; 
Shenoy  and  Shafer,  1988])  that  the  computation  of 
beliefs  in  trees^  is  tractable.  Therefeve,  a  developer 
could  be  faced  with  the  unpleasant  situation  in  which 
the  belief  network  is  nicely  structured  for  efficient 
computation  of  beliefs,  but  refinement  is  extremely 
difficult. 

Synthesis  of  numerical  parameters  in  knowledge 
bases  is  a  kind  of  refinement  of  knowledge  bases:  the 
numerical  estimates  of  the  pan  'neters  are  not  available, 
while  the  structure  of  the  knowledge  base  is.  The 
structure  of  the  knowledge  base  can  be  determined  by 
answering  relatively  simple  questions  about  indepen¬ 
dence  of  events.  Therefore,  the  knowledge  engineer 
should  believe  more  strongly  in  the  (qualitative)  net¬ 
work  structure  than  in  the  values  of  the  numerical 
parameters,  and  it  is  natural  to  use  an  expert  to  obtain 
the  knowledge  structure  and  initial  guesses.  These 
considerations  suggest  a  methodology  for  die  consUuc- 
tion  of  knowledge-based  systems  that  use  belief  net¬ 
works,  At  the  heart  of  this  methodology  is  a  propose 
and  fit  cycle.  In  this  cycle,  after  interviewing  an 

*when  suitably  defined,  in  the  obvious  way. 

^The  same  is  tnie  for  some  kinds  of  graphs  that  are  not  trees. 
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expert,  the  structure  of  a  belief  network  is  proposed. 
The  network  parameters  are  set  or  adjusted  to  fit 
selected  test  cases.  If  parameters  can  be  set  to  fit  the 
cases,  the  development  is  complete.  If  they  cannot,  the 
designer  needs  to  consult  the  expert  further  until  a  new 
(qualitative)  network  structure  is  proposed.  Another 
attempt  is  made  to  set  the  parameters  to  fit  the  cases, 
and  so  on.  Our  results  indicate  that  it  is  hard  to  auto¬ 
mate  the  "fit"  step  of  this  methodology. 

As  an  example  of  alternative  methods  for  valida¬ 
tion  and  refinement,  consider  the  use  of  a  type  of  ora¬ 
cles:  If  an  expert  is  available  after  construction  of  the 
knowledge  base,  the  expert  could  be  used  as  an  oracle 
to  fiKilitate  knowledge  base  refinement.  In  this  mode, 
the  expert  would  be  queried  with  specially  focused 
questions  allowing  the  synthesis  or  refinement  of 
specific  masses  or  likelihoods.  A  result  by  Valtorta 
concerning  rule  bases  [1987,  ch2q)ter  4  and  ch^ter  7; 
1989b]  indicates  that  automatic  synthesis  or  refinement 
in  certain  belief  networks  that  are  trees  is  doable  in 
polynomial  time  and  suggests  that  it  is  intractable  for 
gr^hs.  More  remains  to  be  done  along  this  line  of 
work. 


Acknowledgements 

We  gratefully  acknowledge  many  useful  discus¬ 
sions  with  Donald  Loveland  of  Duke  University.  The 
referees  suggested  several  improvements. 


References 

Andreassen,  S.,  M.  Woldbye,  B.  Faick,  S.K.  Ander¬ 
sen.  "MUNIN-A  Causal  Probabilistic  Network  for 
Interpretation  of  Electromiogr^hic  Findings."  Proceed¬ 
ings  of  the  Tenth  International  Joint  Conference  on 
Artificial  Intelligence,  366-372,  1987. 

Blum,  A.  and  R.L.  RivesL  "Training  a  3-Node 
Neural  Network  is  NP-Complete."  Proceedings  of  the 
1988  Workshop  on  Computational  Learning  Theory 
(COLT-88),  9-18 

Breaker,  J.,  B.  Wielinga,  M.  van  Someien,  R.  de 
Hoog,  G.  Schreibcr,  P.  de  Greef.,  B.  Bredeweg,  J. 
Wielemaker,  J.-P.  Billault,  M.  Davoodi,  S.  Hayward. 
"Model-Driven  Knowledge  Acquisition:  Interpretation 
Models."  Deliverable  task  Al,  ESPRIT  Projwt  1098, 
and  Memo  87,  VF  Project  Knowledge  Acquisition  in 
Formal  Domains,  1987.  (This  report  is  available  from 
the  Department  of  Social  Science  Infwmatics,  Univer¬ 
sity  of  Amsterdam.) 

Cooper,  GJF.  "Probabilistic  Inference  Using  Belief 
Networks  is  NP-Hard."  Stanford  University  Knowledge 
Systems  Laboratory  Memo  KSL-82-27,  May  1987 
(revised  July  1988). 


Fu,  L.  "Truth  Maintenance  Under  Uncertainty." 
Proceedings  of  the  Fourth  Workshop  on  Uncertainty  in 
Artificial  Intelligence,  119-126, 1988. 

Garey,  MR.  and  D.S.  Johnson.  Computers  and 
Intractability:  A  Guide  to  the  Theory  of  NP- 
Completeness.  New  York:  Freeman,  1989. 

Gordon,  J.  and  E.H.  Shortliffe.  "A  Method  for 
Managing  Evidential  Reasoning  in  a  Hierarchical 
Hypothesis  Space."  Artificial  Intelligence,  26  (1985), 
323-357. 

Judd,  S.  "Complexity  of  Connectionist  Learning 
with  Various  Node  Functions."  Technical  Report  87-60, 
University  of  Massachusetts  at  Amherst,  July  1987. 

Judd,  S.  "Learning  in  Neural  Networks."  Proceed¬ 
ings  of  the  1988  Workshop  on  Computational  Learning 
Theory  (COLT-88),  2-8. 

Kong,  C.T.A.  "Multivariate  Belief  Functions  and 
Graphic^  Models."  Ph.D.  Dissertation,  Department  of 
Statistics,  Harvard  University,  1986.  (Available  as 
Research  Report  S-107,  Department  of  Statistics,  Har¬ 
vard  University.) 

Lauritzen,  S.L.  and  DJ.  Spiegelhalter.  "Local  Com¬ 
putations  with  Probabilities  on  Graphical  Structures  and 
their  Applications  to  Expert  Systems."  Journal  of  the 
Royal  Statistical  Society,  Series  B  (Methodological),  50 
(1988),  2, 157-224. 

Lin,  J.-H.  and  J.S.  Vitter,  "Complexity  Issues  in 
Learning  by  Neural  Nets."  Proceetbngs  of  the  Second 
Annual  Workshop  on  Computational  Learning  Theory 
(COLT-89),  118-133, 1989. 

Mellouli,  K.  "On  the  Propagation  of  Beliefs  in  Net¬ 
works  Using  the  Demp.ster-Shafer  Theory  of  Evi¬ 
dence."  Ph.D.  Dissertation  and  Wwking  Paper  No.  196, 
School  of  Business,  University  of  Kansas,  April  1988. 

Pearl,  J.  Probabilistic  Reasoning  in  Intelligent  Sys¬ 
tems:  Networks  of  Plausible  Inference.  San  Mateo, 
California:  Morgan-Kaufmann,  1988. 

Provan,  G.M.  "A  Logic-Based  Analysis  of 
Dempster-Shafer  Theory."  Report,  Department  of  Com¬ 
puter  Science,  University  of  British  Columbia,  October, 
1989. 

Ruspini,  E.H.  "Possibility  Theory  Approaches  for 
Advanced  Information  Systems."  Computer,  15,  9  (Sep- 
temb^  1982),  83-91. 

Shafer,  G.,  P.P.  Shenoy,  and  K.  Mellouli.  "Pro¬ 
pagating  Belief  Functions  in  Qualitative  Markov 
Trees."  International  Journal  of  Approximate  Reason- 
fwo  1087 

Shenoy,  P.P.  and  G.  Shafer.  "Propagating  Belief 
Functions  with  Local  Computations."  IEEE  Expert,  1,  3 
(Fall  1986),  43-52. 

Shenoy,  P.P.  and  G.  Shafer.  "An  Axiomatic  Frame¬ 
work  for  Bayesian  and  Belief-function  Propagation" 
Proceedings  of  the  Fourth  Workshop  on  Uncertainty  in 
Artificial  Intelligence,  307-314, 1988. 

Smets,  P.  "Belief  Functions."  Chapter  9  in:  Smets, 


424 


Valtorta 


P.,  E.H.  Mamdani,  D.  Dubois,  and  H.  Prade.  Non- 
Standard  Logics  for  Automated  Reasoning.  London; 
Academic  Press,  1988. 

Valtorta,  M.  "Automating  Rule  Strengths  in  Expert 
Systems."  Rj.D.  Dissertation,  Department  of  Computer 
Science,  Duke  University,  April  1987.  (Also:  Techni¬ 
cal  Report  CS-1987-15,  Department  of  Computer  Sci¬ 
ence,  Duke  University  W  available  as  number 
ADG87-25869  from  University  Microfilm  Interna¬ 
tional.) 

Valtorta,  M.  (a)  "Some  Results  on  the  Complexity  of 
Knowledge  Base  Refinement"  Technical  Report 
TR89004,  Department  of  Computer  Science,  University 
of  South  Carolina,  April  1989  (Revised  version 
accq)ted  for  publication  in  the  International  Journal  of 
Approximate  Reasoning.) 

Valu^ta,  M.  (b)  "Some  Results  on  Knowledge  Base 
Refinement  with  an  Oracle."  Technical  Report 
TR89005,  Department  of  Computer  Science,  University 
of  South  Carolina,  April  1989. 

Valtorta,  M.  (c)  "Some  Results  on  the  Complexity  of 
Knowledge-base  Refinement"  Proceedings  of  the  Sixth 
International  Workshop  on  Machine  Learning,  323-331. 

Valtorta,  M.  and  D.W.  Loveland.  "On  the  Complex¬ 
ity  of  Belief  Network  Synthesis  and  Refinement." 
Technical  Rqwrt  TR89011,  Depa^ent  of  Computer 
Science,  University  of  South  Carolina,  November  1989 
(submitted  for  journal  publication). 


Appendix:  proof  of  Theorem  1 

One  in  Three  Satisfiability  (OTS)  [Garey  and 
Johnson,  1979,  p.259]  will  be  transformed  into  MS. 
The  variant  in  which  no  clause  in  the  formula  contains 
a  negative  literal  will  be  used.  The  generic  OTS 
instance  is  a  formula  in  3-conjunctive  nramal  form, 
with  no  negated  variables.  The  question  is  whether 
there  is  a  model  for  the  expression  such  that  each 
clause  has  exacdy  one  true  variable. 

Given  a  formula  E  in  monotone  3-conjunctive 
normal  form,  the  following  algorithm  produces  in  time 
polynomial  in  the  size  of  E  an  instance  of  MS  such 
that  the  Question  has  answer  yes  if  and  only  if  E  has  a 
model  in  which  only  one  variable  per  clause  is  true. 

Let  n  be  the  number  of  distinct  prc^sitional 
variables  in  E,  m  be  the  number  of  clauses  in  E.  (n 
and  m  can  be  obtained  in  polynomial  time  from  any 
reasonable  encoding  of  E.)  ^ame  the  variables  Xj,...,x^ 
for  convenience.) 

The  number  of  leaves  in  the  DS-tree  of  the 
corresponding  MS -instance  is  n.  The  number  of  cases 
in  the  corresponding  MS-instance  is  2m. 

There  are  2  cases  for  each  clause  in  E.  The  cases 
are  defined  as  follows.  Let  a  and  b  be  a  pair  of 
numbers  such  that  0<a<b<l.  Let  a  generic  clause  con¬ 
tain  the  variables  x.,  x^,  x^.  The  input  part  of  the  first 
case  for  each  clause  has  m(c.)  =  m(Cj)  =  m(Cj^)  =  a  and 
0  everywhere  else.  The  output  part  of  the  first  case  for 
each  clause  is  a.  To  obtain  the  second  case  for  this 
clause,  substitute  b  for  a. 

The  reader  can  easily  verify  that  the  algorithm 
just  given  runs  in  time  polynomial  in  the  size  of  E. 

As  an  example.  Figure  A  shows  the  instance  of 
MS  corresponding  to  E  =  (XjVx^vXj)  &  (XjVx^vXj).  In 
the  Figure,  T,  and  Tj  correspond  to  the  first  clause  in 
E,  while  Tj  and  T^  correspond  to  the  second  clause  in 
E. 

In  order  to  prove  that  an  instance  of  MS  built 
according  to  the  algorithm  just  given  is  a  yes-instance 
if  and  only  if  the  corresponding  instance  of  OTS  is  a 
yes-instance,  the  following  fact  is  useful. 

Let  [p+]  denote  the  probabilistic  sum  operator, 
defined  as  a  [p+]  b  =  a  +  b  -  ab.  It  is  easy  to  show,  on 
the  basis  of  an  observation  by  Gordon  and  Shortliffe 
[1985,  section  3.3],  that,  indicating  the  belief  in  d  as 
Bel(d), 

Bel(d)  =  m(c,)*s,  [pf] . . .  [p+]  m(c„)*s„.’ 


^E«ch  m  should  have  a  different  subscript,  which  is  dropped 
for  readabili^. 
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m,(c,  =>  d)  =  Sj,  mj(e)  =  1  -  s, 
m4(c^  =>  d)  =  S4,  m4(e)  =  1-84 


T,:  ((a,a,a,0),a) 

Tjt  ((b.b.b,0).b) 

T3:  ((a,a,0,a),a) 

T4:  ((b,b.0.b).b) 

Figure  A  Instance  of  MS  conesponding  to  (XjVx^vXj)  & 

The  "if  part"  is  simpler.  If  variable  x.  in  the 
model  for  E  is  true,  set  s.  to  1.  Otherwise,  set  it  to  0. 
This  insures  that  exactly  one  of  the  masses  correspond¬ 
ing  to  each  case  is  1  and  the  other  two  are  0.  There¬ 
fore,  the  computed  Bel  is  equal  to  the  mass.  There¬ 
fore,  each  case  is  satisfied. 

The  "only  if  part  is  proved  now.  Assume  that 
we  have  a  yes-instance  of  MS.  It  will  be  shown  that, 
in  order  for  an  instance  of  MS  to  be  a  yes-instance,  it 
must  be  that  exactly  one  of  the  s.  corresponding  to 
each  case  is  1  and  the  other  two  are  0.  By  assigning 
true  to  variable  corresponding  to  this  single  s.,  a  model 
for  E  is  obtained  that  satisfies  the  "one  in  threo"  condi¬ 
tion.  Consider  a  generic  pair  of  cases  corresponding  to 
a  clause  in  E.  We  show,  by  algebraic  manipulation, 
that  this  pair  is  satisfied  if  and  only  if  exactly  one  of 
the  three  Sj  corresponding  to  the  cases  is  1  and  the 
other  two  are  0.  Call  the  strengths  x,  y,  and  z.  The 


;,VX2VX4). 

pair  of  cases  is  satisfied  if  and  or  dlowing 

system  has  a  solution: 

ax[pf]ay[p+]az  =  a 
bx[p+]by[pf  ]bz  =  b, 

i.e.,  after  carrying  out  the  probabilistic  sums  and  divid¬ 
ing  each  side  by  a, 

X  +  y  -  axy  +  z  -  axz  -  ayz  +  a^yz  =  1 
X  +  y  -  bxy  +  z  -  bxz  -  byz  +  b^yz  =  1. 

If  any  two  of  x,  y,  and  z  have  value  0,  the  sys¬ 
tem  has  a  solution  if  and  only  if  the  other  variable  has 
value  I. 

To  show  that  the  system  has  no  solution  if  only 
one  of  the  three  variables  is  0,  subtract  the  second 
from  the  first  equation  side  by  side  and  divide  by  (b-a): 

xy  +  xz  +yz  =  (b+a)xyz. 


426  Valtorta 


This  equation  has  no  solution  if  only  one  of  the 
three  variables  is  0. 

The  only  case  left  is  that  in  which  the  three  vari¬ 
ables  are  all  positive  (and,  of  course,  no  greater  than 
1).  In  this  case,  each  of  the  products  xy,  xz,  and  yz  is 
greater  than  or  equal  to  xyz: 
xy  +  xz  +  yz  >  2xyz  >  (b+a)xyz,  and  the  wfore  it  is 
impossible  that 
xy  +  xz  +  yz  =  (b+a)xyz. 

It  has  been  shown  that  MS  is  NP-Hard.  In  order 
to  complete  the  proof  that  MS  is  NP-Complete,  it 
remains  to  show  that  MS  is  in  NP.  A  non- 
detoministic  program  to  solve  MS  has  a  loop  whose 
body  assigns  (non-deterministically)  a  value  to  each 
mass  and  tests  whether  for  that  assignment  the  function 
realized  by  the  DS-tree  satisfies  all  cases.  Since  the 
test  can  be  performed  in  deterministic  polynomial  time, 
the  whole  program  runs  in  non-deterministic  polyno¬ 
mial  time. 

(End  of  proof  of  Theorem  1) 

Remark. 

Note  that  we  have  not  specified  the  possible 
values  for  Sj,...,s^  in  the  statement  of  problem  MS.  The 
proof  of  NP-Hardness  shows  that  MS  is  NP-Complete 
even  when  the  values  are  restricted  to  be  in  (0,1).  If 
there  are  more  than  a  constant  number  of  different 
values,  MS  remains  NP-Hard,  but  we  cannot  show  it  to 
be  NP-Complete.  Similarly,  we  have  not  specified  the 
possible  values  for  the  input  part  of  the  cases.  The 
proof  shows  that  MS  is  NP-Hard  even  when  the  values 
are  restricted  to  a  and  b,  0<a<b<].  Finally,  MS  is 
NP-Complete  even  if  both  input  and  Sj  values  are  res¬ 
tricted  as  just  outlined  at  the  same  time. 
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